I have a set of approx 1 million rows (approx rowsize: 1.5kb) that needs to be "cached" so that many different parts of our application can utilize it.
These rows are a derived/denormalized "view" of compiled data from other tables. Generating this data isn't terribly expensive (30-60sec) but is far too slow to generate "on the fly" as part of a view or table-valued function that the application can query directly. I want to update this data periodically, perhaps every few minutes.
My first thought is to have a scheduled job that updates a global temp table with this data every n minutes.
What's the best strategy, performance-wise? I'm not sure of the performance implications of storing it in a real table versus a global temp table (##tablename) versus other strategies I haven't thought of. I don't want to muck up the transaction logs with inserts to this table... it's all derived data and doesn't need to be persisted.
I'm using Microsoft SQL Server 2000. Upgrading during the timeframe of this project isn't an option, but if there's functionality in 2005/2008/2010 that would make this easier, I'd appreciate hearing about that.
I'd recommend using a materialized view (AKA indexed view).
Limitations:
View definition must always return the same results from the same underlying data.
Views cannot use non-deterministic functions.
The first index on a View must be a clustered, UNIQUE index.
If you use Group By, you must include the new COUNT_BIG(*) in the select list.
View definition cannot contain the following:
TOP
Text, ntext or image columns
DISTINCT
MIN, MAX, COUNT, STDEV, VARIANCE, AVG
SUM on a nullable expression
A derived table
Rowset function
Another view
UNION
Subqueries, outer joins, self joins
Full-text predicates like CONTAIN or FREETEXT
COMPUTE or COMPUTE BY
Cannot include order by in view definition
Related
Sorry for the longish description... but here we go...
We have a fact table somewhat flattened with a few properties that you might have put in a dimension in a more "classic" data warehouse.
I expect to have billions of rows in that table.
We want to enrich these properties with some cleansing/grouping that would not change often, but would still do from time to time.
We are thinking of keeping this initial fact table as the "master" that we never update or delete from, and making an "extended fact" table copy of it where we just add the new derived properties.
The process of generating these extended property values requires mapping to some fort of lookup table, from which we get several possibilities for each row, and then select the best one (one per initial row).
This is likely to be processor intensive.
QUESTION (at last!):
Imagine my lookup table is modified and I want to re-assess the extended properties for only a subset of my initial fact table.
I would end up with a few million rows I want to MODIFY in the target extended fact table.
What would be the best way to achieve this update? (updating a couple of million rows within a couple of billion rows table)
Should I write an UPDATE statement with a join?
Would it be better to DELETE this million rows and INSERT the new ones?
Any other way, like creating a new extended fact table with only the appropriate INSERTs?
Thanks
Eric
PS: I come from a SQL Server background where DELETE can be slow
PPS: I still love SQL Server too! :-)
Write performance for Snowflake vs. traditional RDBS behaves quite differently. All your tables persist in S3, and S3 does not let you rewrite only select bytes of an existing object; the entire file object must be uploaded and replaced. So, while in say SQL server where data and indexes are modified in place, creating new pages as necessary, an UPDATE/DELETE in snowflake is a full sequential scan on the table file, creating an immutable copy of the original with applicable rows filtered out (deleted) or modified (update), which then replaces the file just scanned.
So, whether updating 1 row, or 1M rows, at minimum the entirety of the micro-partitions that the modified data exists in will have to be rewritten.
I would take a look at the MERGE command, which allows you to insert, update, and delete all in one command (effectively applying the differential from table A into table B. Among other things, it should keep your Time-Travel costs down vs constantly wiping and rewriting tables. Another consideration is that since snowflake is column oriented, a column update in theory should only require operations on the S3 files for that column, whereas an insert/delete would replace all S3 files for all columns, which would lower performance.
I use SSMS 2016. I have a view that has a few millions of records. The view is not indexed and should not be as it's being updated (insert, delete, update) every 5 minutes by a job on the server to then display update data sets in to the client calling application in GUI.
The view does a very heavy volume of conversion INT values to VARCHAR appending to them some string values.
The view also does some CAST operations on the NULL assigning them column names aliases. And the worst performance hit is that the view uses FOR XML PATH('') function on 20 columns.
Also the view uses two CTEs as the source as well as Subsidiaries to define a single column value.
I made sure I created the right indexes (Clustered, nonclustered,composite and Covering) that are used in the view Select,JOIN,and WHERE clauses.
Database Tuning Advisor also have not suggested anything that could substantialy improve performance.
AS a workaround I decided to create two identical physical tables with clustered indexes on each and using the Merge statement (further converted into a SP and then Into as SQL Server Agent Job) maintain them updated. And to assure there is no long locking of the view. I will then swap(rename) the tables names immediately after each merge finishes. So in this case all the heavy workload falls onto a SQL Server Agent Job keep the tables updated.
The problem is that the merge will take roughly 15 minutes considering current size of the data, which may increase in the future. So, I need to have a real time design to assure that the view has the most up-to-date information.
Any ideas?
I am creating a Java function that needs to use a SQL query with a lot of joins before doing a full scan of its result. Instead of hard-coding a lot of joins I decided to create a view with this complex query. Then the Java function just uses the following query to get this result:
SELECT * FROM VW_####
So the program is working fine but I want to make it faster since this SELECT command is taking a lot of time. After taking a look on its plan execution plan I created some indexes and made it +-30% faster but I want to make it faster.
The problem is that every operation in the execution plan have cost between 0% and 4% except one operation, a clustered-index insert that has +-50% of the execution cost. I think that the system is using a temporary table to store the view's data, but an index in this view isn't useful for me because I need all rows from it.
So what can I do to optimize that insert in the CWT_PrimaryKey? I think that I can't turn off that index because it seems to be part of the SQL Server's internals. I read somewhere that this operation could appear when you use cursors but I think that I am not using (or does the view use it?).
The command to create the view is something simple (no T-SQL, no OPTION, etc) like:
create view VW_#### as SELECTS AND JOINS HERE
And here is a picture of the problematic part from the execution plan: http://imgur.com/PO0ZnBU
EDIT: More details:
Well the query to create the problematic view is a big query that join a lot of tables. Based on a single parameter the Java-Client modifies the query string before creating it. This view represents a "data unit" from a legacy Database migrated to the SQLServer that didn't had any Foreign or Primary Key, so our team choose to follow this strategy. Because of that the view have more than 50 columns and it is made from the join of other seven views.
Main view's query (with a lot of Portuguese words): http://pastebin.com/Jh5vQxzA
The other views (from VW_Sintese1 until VW_Sintese7) are created like this one but without using extra views, they just use joins with the tables that contain the data requested by the main view.
Then the Java Client create a prepared Statement with the query "Select * from VW_Sintese####" and execute it using the function "ExecuteQuery", something like:
String query = "Select * from VW_Sintese####";
PreparedStatement ps = myConn.prepareStatement(query,ResultSet.TYPE_SCROLL_INSENSITIVE, ResultSet.CONCUR_READ_ONLY);
ResultSet rs = ps.executeQuery();
And then the program goes on until the end.
Thanks for the attention.
First: you should post the code of the view along with whatever is using the views because of the rest of this answer.
Second: the definition of a view in SQL Server is later used to substitute in querying. In other words, you created a view, but since (I'm assuming) it isn't an indexed view, it is the same as writing the original, long SELECT statement. SQL Server kind of just swaps it out in the DML statement.
From Microsoft's 'Querying Microsoft SQL Server 2012': T-SQL supports the following table expressions: derived tables, common table expressions (CTEs), views, inline table-valued functions.
And a direct quote:
It’s important to note that, from a performance standpoint, when SQL Server optimizes
queries involving table expressions, it first unnests the table expression’s logic, and therefore interacts with the underlying tables directly. It does not somehow persist the table expression’s result in an internal work table and then interact with that work table. This means that table expressions don’t have a performance side to them—neither good nor
bad—just no side.
This is a long way of reinforcing the first statement: please include the SQL code in the view and what you're actually using as the SELECT statement. Otherwise, we can't help much :) Cheers!
Edit: Okay, so you've created a view (no performance gain there) that does 4-5 LEFT JOIN on to the main view (again, you're not helping yourself out much here by eliminating rows, etc.). If there are search arguments you can use to filter down the resultset to fewer rows, you should have those in here. And lastly, you're ordering all of this at the top, so your query engine will have to get those views, join them up to a massive SELECT statement, figure out the correct order, and (I'm guessing here) the result count is HUGE and SQL's db engine is ordering it in some kind of temporary table.
The short answer: get less data (fewer columns and only the rows you need); don't order the results if the resultset is very large, just get the data to the client and then sort it there.
Again, if you want more help, you'll need to post table schemas and index strategies for all tables that are in the query (including the views that are joined) and you'll need to include all view definitions (including the views that are joined).
Can someone explain to me what views or materialized views are in plain everyday English please? I've been reading about materialized views but I don't understand.
Sure.
A normal view is a query that defines a virtual table -- you don't actually have the data sitting in the table, you create it on the fly by executing.
A materialized view is a view where the query gets run and the data gets saved in an actual table.
The data in the materialized view gets refreshed when you tell it to.
A couple use cases:
We have multiple Oracle instances where we want to have the master data on one instance, and a reasonably current copy of the data on the other instances. We don't want to assume that the database links between them will always be up and operating. So we set up materialized views on the other instances, with queries like select a,b,c from mytable#master and tell them to refresh daily.
Materialized views are also useful in query rewrite. Let's say you have a fact table in a data warehouse with every book ever borrowed from a library, with dates and borrowers. And that staff regularly want to know how many times a book has been borrowed. Then build a materialized view as select book_id, book_name, count(*) as borrowings from book_trans group by book_id, book_name, set it for whatever update frequency you want -- usually the update frequency for the warehouse itself. Now if somebody runs a query like that for a particular book against the book_trans table, the query rewrite capability in Oracle will be smart enough to look at the materialized view rather than walking through the millions of rows in book_trans.
Usually, you're building materialized views for performance and stability reasons -- flaky networks, or doing long queries off hours.
A view is basically a "named" SQL statement. You can reference views in your queries much like a real table. When accessing the view, the query behind the view is executed.
For example:
create view my_counter_view(num_rows) as
select count(*)
from gazillion_row_table;
select num_rows from my_counter_view;
Views can be used for many purposes such as providing a simpler data model, implement security constraints, SQL query re-use, workaround for SQL short comings.
A materialized view is a view where the query has been executed and the results has been stored as a physical table. You can reference a materialized view in your code much like a real table. In fact, it is a real table that you can index, declare constraints etc.
When accessing a materialized view, you are accessing the pre-computed results. You are NOT executing the underlaying query. There are several strategies for how to keeping the materialized view up-to-date. You will find them all in the documentation.
Materialized views are rarely referenced directly in queries. The point is to let the optimizer use "Query Rewrite" mechanics to internally rewrite a query such as the COUNT(*) example above to a query on the precomputed table. This is extremely powerful as you don't need to change the original code.
There are many uses for materialied views, but they are mostly used for performance reasons. Other uses are: Replication, complicated constraint checking, workarounds for deficiencies in the optimizer.
Long version: -> Oracle documentation
A view is a query on one or more tables. A view can be used just like a table to select from or to join with other tables or views. A metrialized view is a view that has been fully evaluated and its rows have been stored in memory or on disk. Therefore each time you select from a materialized view, there is no need to perform the query that produces the view and the results are returned instantly.
For example, a view may be a query such as SELECT account, SUM(payment) FROM payments GROUP BY account with a large number of payments in the table but not many accounts. Each time this view is used the whole table must be read. With a materialized view, the result is returned instantly.
The non-trivial issue with materialized views is to update them when the underlying data is changed. In this example, each time a new row is added to the payments table, the row in the materialized view that represents the account needs to be updated. These updates may happen synchronously or periodically.
Yes. Materialized views are views with a base table underneath them. You define the view and Oracle creates the base table underneath it automatically.
By executing the view and placing the resulting data in the base table you gain performance.
They are useful for a variety of reasons. Some examples of why you would use a materialized view are:
1) A view that is complex may take a long time to execute when referenced
2) A view included in complex SQL may yield poor execution plans leading to performance issues
3) You might need to reference data across a slow DBLINK
A materialized view can be setup to refresh periodically.
You can specify a full or partial refresh.
Please see the Oracle documentation for complete information
SQL Server doesn't allow creating an view with schema binding where the view query uses OpenQuery as shown below.
Is there a way or a work-around to create an index on such a view?
The best you could do would be to schedule a periodic export of the AD data you are interested in to a table.
The table could of course then have all the indexes you like. If you ran the export every 10 minutes and the possibility of getting data that is 9 minutes and 59 seconds out of date is not a problem, then your queries will be lightning fast.
The only part of concern would be managing locking and concurrency during the export time. One strategy might be to export the data into a new table and then through renames swap it into place. Another might be to use SYNONYMs (SQL 2005 and up) to do something similar where you just point the SYNONYM to two alternating tables.
The data that supplies the query you're performing comes from a completely different system outside of SQL Server. There's no way that SQL Server can create an indexed view on data it does not own. For starters, how would it be notified when something had been changed so it could update its indexes? There would have to be some notification and update mechanism, which is implausible because SQL Server could not reasonably maintain ACID for such a distributed, slow, non-SQL server transaction to an outside system.
Thus my suggestion for mimicking such a thing through your own scheduled jobs that refresh the data every X minutes.
--Responding to your comment--
You can't tell whether a new user has been added without querying. If Active Directory supports some API that generates events, I've never heard of it.
But, each time you query, you could store the greatest creation time of all the users in a table, then through dynamic SQL, query only for new users with a creation date after that. This query should theoretically be very fast as it would pull very little data across the wire. You would just have to look into what the exact AD field would be for the creation date of the user and the syntax for conditions on that field.
If managing the dynamic SQL was too tough, a very simple vbscript, VB, or .Net application could also query active directory for you on a schedule and update the database.
Here are the basics for Indexed views and thier requirements. Note what you are trying to do would probably fall in the category of a Derived Table, therefore it is not possible to create an indexed view using "OpenQuery"
This list is from http://www.sqlteam.com/article/indexed-views-in-sql-server-2000
1.View definition must always return the same results from the same underlying data.
2.Views cannot use non-deterministic functions.
3.The first index on a View must be a clustered, UNIQUE index.
4.If you use Group By, you must include the new COUNT_BIG(*) in the select list.
5.View definition cannot contain the following
a.TOP
b.Text, ntext or image columns
c.DISTINCT
d.MIN, MAX, COUNT, STDEV, VARIANCE, AVG
e.SUM on a nullable expression
f.A derived table
g.Rowset function
h.Another view
i.UNION
j.Subqueries, outer joins, self joins
k.Full-text predicates like CONTAIN or FREETEXT
l.COMPUTE or COMPUTE BY
m.Cannot include order by in view definition
In this case, there is no way for SQL Server to know of any changes (data, schema, whatever) in the remote data source. For a local table, it can use SCHEMABINDING etc to ensure the underlying tables(s) stay the same and it can track datachanges.
If you need to query the view often, then I'd use a local table that is refreshed periodically. In fact, I'd use a table anyway. AD queries are't the quickest at the best of times...