I have a requirement to create a report that is killing the processor and taking a long time to run.
I think I could speed this up significantly by creating an index view that keeps all this data in one place making it a lot easy to query/report on. This view would not just be used for the report as I think it would benefit quite a few areas in the data layer.
The indexed view will potentially contain 5 million+ records, I cant seem to find any guidance as to at what point indexed views are not longer recommended. I assume that an index view of this size would take considerable time to build when SQL first starts, but I would hope after this the cost of maintaining it would be minimal.
Is there any kind of best practice guide as to when to use index views and when not to use them? Would the view rebuild itself after every server restart or does it get stored somewhere on the disk?
The index associated with your Indexed View will be updated whenever updates are made to the any of the columns in the index.
High numbers of updates will most likely kill the benefit. If it is mainly reads then it will work fine.
The real benefits of Indexed Views are when you have aggregates that are too expensive to compute in real time.
Please see: Improving Performance with SQL Server 2008 Indexed Views:
Indexed views can increase query
performance in the following ways:
Aggregations can be precomputed and stored in the index to minimize
expensive computations during query
execution.
Tables can be prejoined and the resulting data set stored.
Combinations of joins or aggregations can be stored.
The query optimizer considers indexed
views only for queries with nontrivial
cost. This avoids situations where
trying to match various indexed views
during the query optimization costs
more than the savings achieved by the
indexed view usage. Indexed views are
rarely used in queries with a cost of
less than 1.
Applications that benefit from the
implementation of indexed views
include:
Decision support workloads.
Data marts.
Data warehouses.
Online analytical processing (OLAP) stores and sources.
Data mining workloads.
From the query type and pattern point
of view, the benefiting applications
can be characterized as those
containing:
Joins and aggregations of large tables.
Repeated patterns of queries.
Repeated aggregations on the same or overlapping sets of columns.
Repeated joins of the same tables on the same keys.
Combinations of the above.
An indexed view (aka materialized view) is maintained by SQL Server after every change to the underlying table(s). Needless to say, you should not have an indexed view on a table that has traffic.
For your problem, a better solution would be to run the query and store it in its own table, like:
select * into CachedReport from YourView
That will give you the performance of an indexed view, while you can decide when to refresh it. For example, you could refresh it by running the select into query from a scheduled job every night.
I'm not aware of any guidance concerning size of indexed views. It's effectively a separate table that's being "automagically" updated every time the base tables on which it depends are updated, so I tend to think of it as a separate table.
As to your question on the building of the index - it's stored on disk, the same as every other index, so it doesn't get rebuilt during server restart (other than any repair that takes place due to transactions not having completed before the restart).
There's no hard row number limit on when to use a table or a materialised view.
However as a guide line avoid a materialised view over volatile tables - the load can kill your server.
First off as Timothy suggested check the indexes on your underlying tables, then the statistics. Your Query Optimiser might be just on the complete track due to missing/out of date statistics.
If this doesn't help with performance check what data is really required from the view as my guess is that a) the row count and b) the row size is what is killing your server loading the whole view into temp table and running it through I/O contention.
Related
I have tables that are historicized and then views are created from them to retain only the most recent and active data.
I wanted to make views that would aggregate some of these views together, where I would create my view as a SELECT * FROM {Other view(s)}. So a bit like this:
Table -> Intermediate View -> Aggregated View
I'm just wondering if I'll run into any performance hit by basing my view on other views. Should I just instead have my aggregated views be more complex code-wise, but based directly on the underlying tables?
Table -> Aggregated View
Or does it not make a difference at all?
Thanks a lot.
From a performance viewpoint, it doesn't make any difference - unless you are making your view out of a single table, in which case you would be able to Materialize your view - in fact, one of the biggest limitations of Materialized Views is that the FROM has to refer to a single table.
From a software engineering viewpoint, I see many advantages like more reusable work and more flexible and, potentially, faster development (while developer-A works on View-A, developer-B works on View-B, and developer-C could even work on View-C to combine View-A and View-C).
The downside is the increase in complexity of the lineage of the views which might require a graphical representation in some cases where objects are too many.
I have found myself doing this more and more in Snowflake, to the point where I'm writing blog and giving it a new acronym, ELVT. I've built a 3 layer stacking of VIEWs at one client. Lowest level is simple against a single table with presentation names for each column. Next layer is business logic for the underlying single table VIEW. 3rd level is joining VIEWs and more complex business logic (lot's of UDFs).
I have a meta-data repository from which I generate all of the VIEWs (which also provides lineages).
The final VIEWs have 35+ joins against 40+ physical tables. Salesforce, Marketo, Eloqua and others.
SELECT * against multiple years of data using medium DW averaged 1min, 25s.
These VIEW replaced thousands of lines of QLIK scripting with SELECT FROM VIEW.
One point to note, is if you are comparing writing one really large block of SQL to nested views, aka macro's.
Then they will perform the same.
The down side to nested views, is you are selecting a lot of columns (in the SQL that is getting compiled) so if at the top level you are not using most the columns, you SQL compile times will be marginally slower.
Also sometime if you put a filter for say a date range, over a large volume of SQL the optimizer can fail to push the filters down, and you can then pull/compute large amounts of data that are later thrown away.
We found this happened, and the optimizer behavior can change with releases. Sometime for the better sometimes for much worse.
We ended up using table functions for a number of parts of SQL to force the date range into the lower layer "views". But we also controlled the layer writing the dynamic SQL so this was an easy substitution.
it depends upon what type of processing you are doing in the View, if it is a lot then you can create a Materialized view (this requires storage, and hence will incur some cost).
1 st option try creating a View and if it does not help then try MV.
In the Snowflake documentation, I could not find a reference to using Indexes.
Does Snowflake support Indexes and, if not, what is the alternative approach to performance tuning when using Snowflake?
Snowflake does not use indexes. This is one of the things that makes Snowflake scale so well for arbitrary queries. Instead, Snowflake calculates statistics about columns and records in files that you load, and uses those statistics to figure out what parts of what tables/records to actually load to execute a query. It also uses a columnar store file format, that lets it only read the parts of the table that contain the fields (columns) you actually use, and thus cut down on I/O on columns that you don't use in the query.
Snowflake slices big tables (gigabyte, terabyte or larger) into smaller "micro partitions." For each micro partition, it collects statistics about what value ranges each column contains. Then, it only loads micro partitions that contain values in the range needed by your query. As an example, let's say you have a column of time stamps. If your query asks for data between June 1 and July 1, then partitions that do not contain any data in this range, will not be loaded or processed, based on the statistics stored for dates in the micropartition files.
Indexes are often used for online transaction processing, because they accelerate workflows when you work with one or a few records, but when you run analytics queries on large datasets, you almost always work with large subsets of each table in your joins and aggregates. The storage mechanism, with automatic statistics, automatically accelerates such large queries, with no need for you to specify an index, or tune any kind of parameters.
Snowflake does not support indexes, though it does support "clustering" for performance improvements of I/O.
I recommend reading these links to get familiar with this:
https://docs.snowflake.net/manuals/user-guide/tables-clustering-keys.html
https://docs.snowflake.net/manuals/user-guide/tables-auto-reclustering.html
Here's a really good blog post on the topic as well:
https://www.snowflake.com/blog/automatic-query-optimization-no-tuning/
Hope this helps...Rich
No Snowflake does not have indexes. Its performance boosts come through by eliminating unnecessary scanning which it achieves my maintaining rich metadata in each of its micro partitions. For instance if you have a time filter in your query and your table is more or less sorted by time, then Snowflake can "prune" away the parts of the table that are not relevant to the query.
Having said this, Snowflake is constantly releasing new features and one such feature is its Search Optimisation Service which allows you to perform "needle in a hay stack" queries on selected columns that you enable. Not quite indexes that you can create, but something like that being used behind the scenes perhaps.
No, Snowflake doesn't support indexes. And don't let them tell you that this is an advantage.
Performance tuning can be done as described above, but is often is done with money: Pay for bigger warehouses.
Snowflake doesn't support indexes, it keeps data in micro partition or in another sense it breaks data sets in small files and format rows to column and compress them. Snowflake metadata manager in service layer will have all the information about each micro partition like which partition have which data.
Each partition will have information about itself in header like max value, min value, cardinality etc. this is much better then indexes as compare to conventional databases.
Snowflake is a columnar database with automatic micro-partitioning. Note that in SQL Server, Microsoft call their columnar storage option a column store index.
The performance gain from columnar storage on data warehouse/mart type queries is spectacular compared with their row store brethren. By storing data by column the columns can be greatly compressed allowing a huge amount of data can be held in memory.
If your predominant queries are on a naturally ordered column, such as OrderDate then it makes sense to cluster on OrderDate. You will gain a performance benefit from doing that.
Clustering isn't a catch-all performance boost. Choose your clustering unwisely and you can degrade performance for your queries.
In terms of performance tuning there are techniques you can use.
When using a dimensional model look at the most commonly used aspects of those dimensions and look to denormalise those aspects into your fact tables to reduce the number of joins.
For example, if the queries use Week, Month and Quarter then denormalise those aspects into the fact table giving you performance concerns. The affect on storage in a column store DB is far less than in a row store DB so the cost/benefit balance is much better.
Materialised views are another way of performance tuning however these come with caveats.
The range of SQL statements available to you for materialised views is far less than for other views
Not all aggregates are supported
Can only be on a single table
They work well when data doesn't change often.
If your underlying table is clustered on OrderDate then a materialised view of last months orders might not give you the desired performance benefit because partition pruning might already be doing what is needed.
If your query performance is as a result of contention with other users then spinning up another warehouse might be the answer. 2 warehouses dedicated to their tasks might be more cost effective than scaling up a single warehouse.
Primary/unique key constraints can be defined but are metadata only despite the constraint documentation describing the enforced/not enforced syntax.
Some distributed column stores do support PK and FK constraints, Vertica being an example, but most do not because the performance impact of enforcing them is too high.
** Updated Fall 2022 - thanks to Hobo's comment: Yes, via Unistore's Hybrid Tables. **
Original Response:
Neither Snowflake nor any high-performance big data / OLAP system will support [unique] indexes because these systems are MPP (Massively Parallel Processing). MPP systems load data with thousands of concurrent inserts into the same table. [Unique] Indexes are a concept from much smaller / OLTP systems. Even then many data engineers intentionally disable the [unique] indexes on OLTP systems when they approach big data scale especially as the data is inserted or frequently updated and deleted.
If you want a "non-unique index" then you can use a slew of features such as: micro-partitions, clustered tables, auto-clustering, Search Optimization Service, etc.
This Medium can give you some workarounds. How can we enforce [Unique, Primary Key, Foreign Key (UPF)] column constraints in Snowflake?
Snowflake does not support indexing natively, but it has other ways to tune performance:
Reduce queuing by setting a time-out and/or adjusting the max concurrency
Use result caching
Tackle disk spilling
Rectify row expansion by using the distinct clause, using temporary tables and checking your join order
Fix inadequate pruning by setting up data clustering
Reference: https://rockset.com/blog/what-do-i-do-when-my-snowflake-query-is-slow-part-2-solutions/ (Disclosure: I work for Rockset).
In short,
snowflake does not support indexes but a single clustering key on a each table.
Snowflake does not support indexes but if you are looking for optimization you can use search optimization service of Snowflake.
Please refer below snowflake documentation.
https://docs.snowflake.com/en/user-guide/search-optimization-service.html
Snowflake's Search Optimization Service will create indexes over all the pertinent columns in a table "out of the box" as well as other advances search features (e.g. substring and regex matching).
If you'd like optimize for specific expressions used in your queries, you can customize SOS, as well.
How can you determine if the performance gained on a SELECT by indexing a column will outweigh the performance loss on an INSERT in the same table? Is there a "tipping-point" in the size of the table when the index does more harm than good?
I have table in SQL Server 2008 with 2-3 million rows at any given time. Every time an insert is done on the table, a lookup is also done on the same table using two of its columns. I'm trying to determine if it would be beneficial to add indexes to the two columns used in the lookup.
Like everything else SQL-related, it depends:
What kind of fields are they? Varchar? Int? Datetime?
Are there other indexes on the table?
Will you need to include additional fields?
What's the clustered index?
How many rows are inserted/deleted in a transaction?
The only real way to know is to benchmark it. Put the index(es) in place and do frequent monitoring, or run a trace.
This depends on your workload and your requirements. Sometimes data is loaded once and read millions of times, but sometimes not all loaded data is ever read.
Sometimes reads or writes must complete in certain time.
case 1: If table is static and is queried heavily (eg: item table in Shopping Cart application) then indexes on the appropriate fields is highly beneficial.
case 2: If table is highly dynamic and not a lot of querying is done on a daily basis (eg: log tables used for auditing purposes) then indexes will slow down the writes.
If above two cases are the boundary cases, then to build indexes or not to build indexes on a table depends on which case above does the table in contention comes closest to.
If not leave it to the judgement of Query tuning advisor. Good luck.
All,
Looking for some guidance on an Oracle design decision I am currently trying to evaluate:
The problem
I have data in three separate schemas on the same oracle db server. I am looking to build an application that will show data from all three schemas, however the data that is shown will be based on real time sorting and prioritisation rules that is applied to the data globally (i.e.: based on the priority weightings applied I may pull back data from any one of the three schemas).
Tentative Solution
Create a VIEW in the DB which maintains logical links to the relevant columns in the three schemas, write a stored procedure which accepts parameterised priority weightings. The application subsequently calls the stored procedure to select the ‘prioritised’ row from the view and then queries the associated schema directly for additional data based on the row returned.
I have concerns over performance where the data is being sorted/ prioritised upon each query being performed but cannot see a way around this as the prioritisation rules will change often. We are talking of data sets in the region of 2-3 million rows per schema.
Does anyone have alternative suggestions on how to provide an aggregated and sorted view over the data?
Querying from multiple schemas (or even multiple databases) is not really a big deal, even inside the same query. Just prepend the table name with the schema you are interested in, as in
SELECT SOMETHING
FROM
SCHEMA1.SOME_TABLE ST1, SCHEMA2.SOME_TABLE ST2
WHERE ST1.PK_FIELD = ST2.PK_FIELD
If performance becomes a problem, then that is a big topic... optimal query plans, indexes, and your method of database connection can all come into play. One thing that comes to mind is that if it does not have to be realtime, then you could use materialized views (aka "snapshots") to cache the data in a single place. Then you could query that with reasonable performance.
Just set the snapshots to refresh at an interval appropriate to your needs.
It doesn't matter that the data is from 3 schemas, really. What's important to know is how frequently the data will change, how often the criteria will change, and how frequently it will be queried.
If there is a finite set of criteria (that is, the data will be viewed in a limited number of ways) which only change every few days and it will be queried like crazy, you should probably look at materialized views.
If the criteria is nearly infinite, then there's no point making materialized views since they won't likely be reused. The same holds true if the criteria itself changes extremely frequently, the data in a materialized view wouldn't help in this case either.
The other question that's unanswered is how often the source data is updated, and how important is it to have the newest information. Frequently updated source day can either mean a materialized view will get "stale" for some duration or you may be spending a lot of time refreshing the materialized views unnecessarily to keep the data "fresh".
Honestly, 2-3 million records isn't a lot for Oracle anymore, given sufficient hardware. I would probably benchmark simple dynamic queries first before attempting fancy (materialized) view.
As others have said, querying a couple of million rows in Oracle is not really a problem, but then that depends on how often you are doing it - every tenth of a second may cause some load on the db server!
Without more details of your business requirements and a good model of your data its always difficult to provide good performance ideas. It usually comes down to coming up with a theory, then trying it against your database and accessing if it is "fast enough".
It may also be worth you taking a step back and asking yourself how accurate the results need to be. Does the business really need exact values for this query or are good estimates acceptable
Tom Kyte (of Ask Tom fame) always has some interesting ideas (and actual facts) in these areas. This article describes generating a proper dynamic search query - but Tom points out that when you query Google it never tries to get the exact number of hits for a query - it gives you a guess. If you can apply a good estimate then you can really improve query performance times
I have an app, which cycles through a huge number of records in a database table and performs a number of SQL and .Net operations on records within that database (currently I am using Castle.ActiveRecord on PostgreSQL).
I added some basic btree indexes on a couple of the feilds, and as you would expect, the performance of the SQL operations increased substantially. Wanting to make the most of dbms performance I want to make some better educated choices about what I should index on all my projects.
I understand that there is a detrement to performance when doing inserts (as the database needs to update the index, as well as the data), but what suggestions and best practices should I consider with creating database indexes? How do I best select the feilds/combination of fields for a set of database indexes (rules of thumb)?
Also, how do I best select which index to use as a clustered index? And when it comes to the access method, under what conditions should I use a btree over a hash or a gist or a gin (what are they anyway?).
Some of my rules of thumb:
Index ALL primary keys (I think most RDBMS do this when the table is created).
Index ALL foreign key columns.
Create more indexes ONLY if:
Queries are slow.
You know the data volume is going to increase significantly.
Run statistics when populating a lot of data in tables.
If a query is slow, look at the execution plan and:
If the query for a table only uses a few columns, put all those columns into an index, then you can help the RDBMS to only use the index.
Don't waste resources indexing tiny tables (hundreds of records).
Index multiple columns in order from high cardinality to less. This means: first index the columns with more distinct values, followed by columns with fewer distinct values.
If a query needs to access more than 10% of the data, a full scan is normally better than an index.
Here's a slightly simplistic overview: it's certainly true that there is an overhead to data modifications due to the presence of indexes, but you ought to consider the relative number of reads and writes to the data. In general the number of reads is far higher than the number of writes, and you should take that into account when defining an indexing strategy.
When it comes to which columns to index I'v e always felt that the designer ought to know the business well enough to be able to take a very good first pass at which columns are likely to benefit. Other then that it really comes down to feedback from the programmers, full-scale testing, and system monitoring (preferably with extensive internal metrics on performance to capture long-running operations),
As #David Aldridge mentioned, the majority of databases perform many more reads than they do writes and in addition, appropriate indexes will often be utilised even when performing INSERTS (to determine the correct place to INSERT).
The critical indexes under an unknown production workload are often hard to guess/estimate, and a set of indexes should not be viewed as set once and forget. Indexes should be monitored and altered with changing workloads (that new killer report, for instance).
Nothing beats profiling; if you guess your indexes, you will often miss the really important ones.
As a general rule, if I have little idea how the database will be queried, then I will create indexes on all Foriegn Keys, profile under a workload (think UAT release) and remove those that are not being used, as well as creating important missing indexes.
Also, make sure that a scheduled index maintenance plan is also created.