How it is possible to create clustered index on a view - sql-server

How it is possible to create clustered indexes on a view in SQL Server 2008.
View is not a real table so there is no sense in physical arrangement of the data that clustered index creates.
Where do I miss the point?

An index always exists on disk. When you create the index, you are materialising the rows of the view on disk even if the view itself is not "real" rows.
MSDN White paper with an explanation

This is a somewhat simplified explanation. There's lots of technical hoo-hah going on under the hood, but it sounded like you wanted a general "wassup" explanation.
A view is, essentially, a pre-written and stored query; whenever you access the view, you're retrieving and plugging that pre-written query into your current query. (Leastways this is how I think of it.)
So these "basic" views read data that's stored in tables already present in the database/on the hard drive. When you build a clustered index on a view, what you are really doing is making a second physical copy of the data that is referenced by the view. For example, if you have table A, create view vA as "select * from A", and then build a clustered index on that view, what you end up with is two copies of the data on the hard drive.
This can be useful if table A is very large, and you want quick access to a small subset of the table (such as only 2-3 columns, or only where Status = 1, or you want quick access to the data that requires an ugly join to produce.)
The fun comes in when you update table A (really, any of the tables referenced by the view), as any changes to the "base" table must also be made to the "view" table. Not a good idea in heavily used OLTP systems.
FYI, I believe SQL's "Indexed Views" are called "Materialized Views" in Oracle. For my money, Materialized View is a much better name/description.

Though a view is not a real object, the clustered index is.
And the rows the view returns can be sorted and stored.
However, to be indexable, the view should satisfy a number of conditions.
Mostly they make sure that the results are persisted and the updates to the underlying table can be easily tracked in a view (so that the index will not have to be rebuilt each time the underlying table is updated).
For instance, SUM(*) and COUNT_BIG(*) are distributive functions:
SUM(set1) + SUM(set2) = SUM(set1 + set2)
COUNT_BIG(set1) + COUNT_BIG(set2) = COUNT_BIG(set1 + set2)
, so it's easy to recalculate values of SUM and COUNT_BIG when the table is changed, using only the view rows and the values of the columns affected.
However, it's not the case with other aggregates, so they are not allowed in an indexed view.

Related

What is the scope of a temporary index on a permanent table?

Within a stored procedure, I created this index:
CREATE NONCLUSTERED INDEX #IX_MyTempIndex ON dbo.MyPermTable (ColumnA, ColumnB) INCLUDE (ColumnC);
Days later, from a different session, a different user got the error "...an index or statistics with name '#IX_MyTempIndex' already exists on table 'dbo.MyPermTable'."
1) Is this the correct way to specify a temporary index on a permanent table?
2) What event or scope will cause the temporary index to disappear?
There is no such thing as a "Temporary Index".
You can make a temp table with an index, and by virtue of the table being temporary the index will be too, but that's not the same as what you're describing.
If you were allowed to make the index, why not keep the index that is necessary for your query? Simply evaluate it and make sure it is a good index for your table. You don't want an additional index that is super similar with only 1 additional column, or other inefficient scenario.
At this point you need to ask yourself some serious questions about the query you're running:
Are you aggregating items in this table, and only this table?
Are you joining to other tables? How many? Are they indexed properly?
How often is this table updated, deleted from, inserted into, etc?
How often is my procedure run?
Given the answer to these, and possibly other questions, you'll know if you should in fact have an index on the table. Or, if you should be creating a temp table or view to do work on in your procedure. In either case, you will not want to create an index, do some work, drop an index. You'll lose more than you'll gain.
As an example, if you're doing some aggregations on values only in this table, and they take a while, it may be beneficial to simply copy the whole table into a view or temporary table. This will release the base table from your locks faster than doing the aggregations, if not, just do your work on the base table.
If you will use it over and over, use a view, you won't have to recreate it each time, and it will be up to date when you run your sproc. If performing your aggregations on the clone is still slow, you can put indexes on a view or temp table.
If you're sproc requires joins, you should probably be indexing the involved tables. Otherwise, no matter what you do with one table, eventually the unoptimized table(s) involved are going to drag you down.
Creating Indexes on permanent table through Stored procedure is not recommended.
This Index will be created on your permanent table and will be there forever unless you delete it. Please look at the below picture.
If you create Index on Temporary table, that will be dropped when the session end.

Is there benefit to index base tables of an indexed view?

After I created the indexed view, I tried disabling all the indexes in base tables including the indexes for foreign key column (constraint is still there) and the query plan for the view stays the same.
It is just like magic to me that the indexed view would be able to optimize the query so much even without base table being indexed. Even without any index on the View, SQL Server is able to do an index scan on the primary key index of the indexed view to retrieve data like 1000 times faster than using the base table.
Something like SELECT * FROM MyView WITH(NOEXPAND) WHERE NotIndexedColumn = 5 ORDER BY NotIndexedColumn
So the first two questions are:
Is there any benefit to index base tables of indexed view?
What is Sql server doing when it is doing a index scan on the PK while the constraint is on a not indexed column?
Then I noticed that if I use full-text search + order by I would see a table spool (eager spool) in the query plan with a cost like 95%.
Query looks like SELECT ID FROM View WITH(NOEXPAND) WHERE CONTAINS(IndexedColumn, '"SomeText*"') ORDER BY IndexedColumn
Question n° 3:
Is there any index I could add to get rid of that operation?
It's important to understand that an indexed view is a "materialized view" and the results are stored onto disk.
So the speedup you are seeing is the actual result of the query you are seeing stored to disk.
To answer your questions:
1) Is there any benefit to index base tables of indexed view?
This is situational. If your view is flattening out data or having many extra aggregate columns, then an indexed view is better than the table. If you are just using your indexed view like such
SELECT * FROM foo WHERE createdDate > getDate() then probably not.
But if you are doing SELECT sum(price),min(id) FROM x GROUP BY id,price then the indexed view would probably be better. Granted, you are doing a more complex query with joins and other advanced options.
2) What is Sql server doing when it is doing a index scan on the PK while the constraint is on a not indexed column?
First we need to understand how clustered indexes are stored. The index is stored in a B-tree. So SQL Server is walking the tree finding all values that match your criteria when you are searching on a clustered index Depending on how you have your indexes set up i.e covering vs non covering and how your non-clustered indexes are set up will determine what the Pages and Extents look like. Without more knowledge of the table structure I can't help you understand what the scan is actually doing.
3)Is there any index I could add to get rid of that operation?
Just because something is taking 95% of the query's time doesn't make that a bad thing. The query time needs to add up to 100%, so no matter what you do there is always going to be something taking up a large percentage of time. What you need to check is the IO reads and how much time the query itself takes.
To determine this, you need to understand that SQL Server caches the results of queries. With this in mind, you can have a query take a long time the first time but afterward since the data itself is cached it would be much quicker. It all depends on the frequency of the query and how your system is set up.
For a more in-depth read on indexed view

Why does indexed view materialize?

If we create an index to a view, we materialize the view.
Why is the view materialized when it is indexed ? What is the signification as opposed to a non-materialized view ?
To my understanding, a normal view don't exist physically. Only its definition is stored, and each reference to the view actually executes the view definition all over again. So when we insert through a view, we insert directly into the table. Is it correct ?
If the view is materialized, it will become a physical table with its data. Then in this case, would modification to the base table is not updated in this view (that has materialized and now lives its own life) anymore ?
Let's think about a table with a clustered index for a minute. When you choose your clustering key, SQL Server creates a b tree, the leaves of which are the actual data. Non-clustered indexes work the same way, except the leaf nodes are tuples that represent your clustering key (so you can traverse the clustered index and get back to the actual data).
Extending the example, when you index a view, you first need to provide a clustered index. What would you expect to live at the leaves of that index? The data of course! :) And any non-clustered indexes on the view will behave exactly like their analogs on a physical table.
As to your question about a materialized view becoming stale, it doesn't. That is, SQL Server knows that the view relies on the table (which is why the view needs to be schema bound so you can't drop one of its constituent tables), and so any DML operations against the constituent tables are also reflected in the table. You can convince yourself of this by creating an indexed view and then looking at a query plan of a simple update to one of the underlying tables. You should see a corresponding update for the indexed view.
A view is simply a select statement that is saved and can be selected from, for convenience. Inserting/updating through a view does go directly to the table to perform its operation.
An indexed view is stored, indexed, just as a table.

What is the best way to implement soft deletion in a large relational Database? [duplicate]

Working on a project at the moment and we have to implement soft deletion for the majority of users (user roles). We decided to add an is_deleted='0' field on each table in the database and set it to '1' if particular user roles hit a delete button on a specific record.
For future maintenance now, each SELECT query will need to ensure they do not include records where is_deleted='1'.
Is there a better solution for implementing soft deletion?
Update: I should also note that we have an Audit database that tracks changes (field, old value, new value, time, user, ip) to all tables/fields within the Application database.
I would lean towards a deleted_at column that contains the datetime of when the deletion took place. Then you get a little bit of free metadata about the deletion. For your SELECT just get rows WHERE deleted_at IS NULL
You could perform all of your queries against a view that contains the WHERE IS_DELETED='0' clause.
Having is_deleted column is a reasonably good approach.
If it is in Oracle, to further increase performance I'd recommend partitioning the table by creating a list partition on is_deleted column.
Then deleted and non-deleted rows will physically be in different partitions, though for you it'll be transparent.
As a result, if you type a query like
SELECT * FROM table_name WHERE is_deleted = 1
then Oracle will perform the 'partition pruning' and only look into the appropriate partition. Internally a partition is a different table, but it is transparent for you as a user: you'll be able to select across the entire table no matter if it is partitioned or not. But Oracle will be able to query ONLY the partition it needs. For example, let's assume you have 1000 rows with is_deleted = 0 and 100000 rows with is_deleted = 1, and you partition the table on is_deleted. Now if you include condition
WHERE ... AND IS_DELETED=0
then Oracle will ONLY scan the partition with 1000 rows. If the table weren't partitioned, it would have to scan 101000 rows (both partitions).
The best response, sadly, depends on what you're trying to accomplish with your soft deletions and the database you are implementing this within.
In SQL Server, the best solution would be to use a deleted_on/deleted_at column with a type of SMALLDATETIME or DATETIME (depending on the necessary granularity) and to make that column nullable. In SQL Server, the row header data contains a NULL bitmask for each of the columns in the table so it's marginally faster to perform an IS NULL or IS NOT NULL than it is to check the value stored in a column.
If you have a large volume of data, you will want to look into partitioning your data, either through the database itself or through two separate tables (e.g. Products and ProductHistory) or through an indexed view.
I typically avoid flag fields like is_deleted, is_archive, etc because they only carry one piece of meaning. A nullable deleted_at, archived_at field provides an additional level of meaning to yourself and to whoever inherits your application. And I avoid bitmask fields like the plague since they require an understanding of how the bitmask was built in order to grasp any meaning.
if the table is large and performance is an issue, you can always move 'deleted' records to another table, which has additional info like time of deletion, who deleted the record, etc
that way you don't have to add another column to your primary table
That depends on what information you need and what workflows you want to support.
Do you want to be able to:
know what information was there (before it was deleted)?
know when it was deleted?
know who deleted it?
know in what capacity they were acting when they deleted it?
be able to un-delete the record?
be able to tell when it was un-deleted?
etc.
If the record was deleted and un-deleted four times, is it sufficient for you to know that it is currently in an un-deleted state, or do you want to be able to tell what happened in the interim (including any edits between successive deletions!)?
Careful of soft-deleted records causing uniqueness constraint violations.
If your DB has columns with unique constraints then be careful that the prior soft-deleted records don’t prevent you from recreating the record.
Think of the cycle:
create user (login=JOE)
soft-delete (set deleted column to non-null.)
(re) create user (login=JOE). ERROR. LOGIN=JOE is already taken
Second create results in a constraint violation because login=JOE is already in the soft-deleted row.
Some techniques:
1. Move the deleted record to a new table.
2. Make your uniqueness constraint across the login and deleted_at timestamp column
My own opinion is +1 for moving to new table. Its take lots of
discipline to maintain the *AND delete_at = NULL* across all your
queries (for all of your developers)
You will definitely have better performance if you move your deleted data to another table like Jim said, as well as having record of when it was deleted, why, and by whom.
Adding where deleted=0 to all your queries will slow them down significantly, and hinder the usage of any of indexes you may have on the table. Avoid having "flags" in your tables whenever possible.
you don't mention what product, but SQL Server 2008 and postgresql (and others i'm sure) allow you to create filtered indexes, so you could create a covering index where is_deleted=0, mitigating some of the negatives of this particular approach.
Something that I use on projects is a statusInd tinyint not null default 0 column
using statusInd as a bitmask allows me to perform data management (delete, archive, replicate, restore, etc.). Using this in views I can then do the data distribution, publishing, etc for the consuming applications. If performance is a concern regarding views, use small fact tables to support this information, dropping the fact, drops the relation and allows for scalled deletes.
Scales well and is data centric keeping the data footprint pretty small - key for 350gb+ dbs with realtime concerns. Using alternatives, tables, triggers has some overhead that depending on the need may or may not work for you.
SOX related Audits may require more than a field to help in your case, but this may help.
Enjoy
Use a view, function, or procedure that checks is_deleted = 0; i.e. don't select directly on the table in case the table needs to change later for other reasons.
And index the is_deleted column for larger tables.
Since you already have an audit trail, tracking the deletion date is redundant.
I prefer to keep a status column, so I can use it for several different configs, i.e. published, private, deleted, needsAproval...
Create an other schema and grant it all on your data schema.
Implment VPD on your new schema so that each and every query will have the predicate allowing selection of the non-deleted row only appended to it.
http://download.oracle.com/docs/cd/E11882_01/server.112/e16508/cmntopc.htm#CNCPT62345
#AdditionalCriteria("this.status <> 'deleted'")
put this on top of your #entity
http://wiki.eclipse.org/EclipseLink/Examples/JPA/SoftDelete

How to create materialized views in SQL Server?

I am going to design a Data Warehouse and I heard about materialized views. Actually I want to create a view and it should update automatically when base tables are changed. Can anyone explain with a query example?
They're called indexed views in SQL Server - read these white papers for more background:
Creating an Indexed View
Improving Performance with SQL Server 2008 Indexed Views
Basically, all you need to do is:
create a regular view
create a clustered index on that view
and you're done!
The tricky part is: the view has to satisfy quite a number of constraints and limitations - those are outlined in the white paper. If you do this, that's all there is. The view is being updated automatically, no maintenance needed.
Additional resources:
Creating and Optimizing Views in SQL Server
SQL Server Indexed Views
Although purely from engineering perspective, indexed views sound like something everybody could use to improve performance but the real life scenario is very different. I have been unsuccessful is using indexed views where I most need them because of too many restrictions on what can be indexed and what cannot.
If you have outer joins in the views, they cannot be used. Also, common table expressions are not allowed... In fact if you have any ordering in subselects or derived tables (such as with partition by clause), you are out of luck too.
That leaves only very simple scenarios to be utilizing indexed views, something in my opinion can be optimized by creating proper indexes on underlying tables anyway.
I will be thrilled to hear some real life scenarios where people have actually used indexed views to their benefit and could not have done without them
You might need a bit more background on what a Materialized View actually is. In Oracle these are an object that consists of a number of elements when you try to build it elsewhere.
An MVIEW is essentially a snapshot of data from another source. Unlike a view the data is not found when you query the view it is stored locally in a form of table. The MVIEW is refreshed using a background procedure that kicks off at regular intervals or when the source data changes. Oracle allows for full or partial refreshes.
In SQL Server, I would use the following to create a basic MVIEW to (complete) refresh regularly.
First, a view. This should be easy for most since views are quite common in any database
Next, a table. This should be identical to the view in columns and data. This will store a snapshot of the view data.
Then, a procedure that truncates the table, and reloads it based on the current data in the view.
Finally, a job that triggers the procedure to start its work.
Everything else is experimentation.
When indexed view is not an option, and quick updates are not necessary, you can create a hack cache table:
select * into cachetablename from myviewname
alter table cachetablename add primary key (columns)
-- OR alter table cachetablename add rid bigint identity primary key
create index...
then sp_rename view/table or change any queries or other views that reference it to point to the cache table.
schedule daily/nightly/weekly/whatnot refresh like
begin transaction
truncate table cachetablename
insert into cachetablename select * from viewname
commit transaction
NB: this will eat space, also in your tx logs. Best used for small datasets that are slow to compute. Maybe refactor to eliminate "easy but large" columns first into an outer view.
For MS T-SQL Server, I suggest looking into creating an index with the "include" statement. Uniqueness is not required, neither is the physical sorting of data associated with a clustered index. The "Index ... Include ()" creates a separate physical data storage automatically maintained by the system. It is conceptually very similar to an Oracle Materialized View.
https://learn.microsoft.com/en-us/sql/relational-databases/indexes/create-indexes-with-included-columns

Resources