Is there any penalty on creating a view from another view? - snowflake-cloud-data-platform

I have tables that are historicized and then views are created from them to retain only the most recent and active data.
I wanted to make views that would aggregate some of these views together, where I would create my view as a SELECT * FROM {Other view(s)}. So a bit like this:
Table -> Intermediate View -> Aggregated View
I'm just wondering if I'll run into any performance hit by basing my view on other views. Should I just instead have my aggregated views be more complex code-wise, but based directly on the underlying tables?
Table -> Aggregated View
Or does it not make a difference at all?
Thanks a lot.

From a performance viewpoint, it doesn't make any difference - unless you are making your view out of a single table, in which case you would be able to Materialize your view - in fact, one of the biggest limitations of Materialized Views is that the FROM has to refer to a single table.
From a software engineering viewpoint, I see many advantages like more reusable work and more flexible and, potentially, faster development (while developer-A works on View-A, developer-B works on View-B, and developer-C could even work on View-C to combine View-A and View-C).
The downside is the increase in complexity of the lineage of the views which might require a graphical representation in some cases where objects are too many.

I have found myself doing this more and more in Snowflake, to the point where I'm writing blog and giving it a new acronym, ELVT. I've built a 3 layer stacking of VIEWs at one client. Lowest level is simple against a single table with presentation names for each column. Next layer is business logic for the underlying single table VIEW. 3rd level is joining VIEWs and more complex business logic (lot's of UDFs).
I have a meta-data repository from which I generate all of the VIEWs (which also provides lineages).
The final VIEWs have 35+ joins against 40+ physical tables. Salesforce, Marketo, Eloqua and others.
SELECT * against multiple years of data using medium DW averaged 1min, 25s.
These VIEW replaced thousands of lines of QLIK scripting with SELECT FROM VIEW.

One point to note, is if you are comparing writing one really large block of SQL to nested views, aka macro's.
Then they will perform the same.
The down side to nested views, is you are selecting a lot of columns (in the SQL that is getting compiled) so if at the top level you are not using most the columns, you SQL compile times will be marginally slower.
Also sometime if you put a filter for say a date range, over a large volume of SQL the optimizer can fail to push the filters down, and you can then pull/compute large amounts of data that are later thrown away.
We found this happened, and the optimizer behavior can change with releases. Sometime for the better sometimes for much worse.
We ended up using table functions for a number of parts of SQL to force the date range into the lower layer "views". But we also controlled the layer writing the dynamic SQL so this was an easy substitution.

it depends upon what type of processing you are doing in the View, if it is a lot then you can create a Materialized view (this requires storage, and hence will incur some cost).
1 st option try creating a View and if it does not help then try MV.

Related

Database Structure for hierarchical data with horizontal slices

We're currently looking at trying to improve performance of queries for our site, the core hierarchical data-structure has 5 levels, each type has about 20 fields.
level1: rarely added, updated infrequently, ~ 100 children
level2: rarely added, updated fairly infrequently, ~ 200 children
level3: added often, updated fairly often, ~ 1-50 children (average ~10)
level4: added often, updated quite often, ~1-50 children (average <10)
level5: added often, updated often (a single item might update once a second)
We have a single data pipeline which performs all of these updates and inserts (ie. we have full control over data going in).
The queries we need to do on this are:
fetch single items from a level + parents
fetch a slice of items across a level (either by PK, or sometimes filtering criteria)
fetch multiple items from level3 and parts of their children (usually by complex criteria)
fetch level3 and all children
We read from this datasource a lot, as-in hundreds of times a second. All of the queries we need to perform are known and optimised as well as they can be to the current data structure.
We're currently using MySQL queries behind memcached for this, and just doing additional queries to get children/parents, I'm thinking that some sort of Tree-based or Document based database might be more suitable.
My question is: what's the best way to model this data for efficient read performance?
Sounds like your data belongs in an OLAP (On-Line Analytical Processing) database. The way you're describing levels, slices, and performance concerns seems to lend itself to OLAP. It's probably modeled fine (not sure though), but you need a different tool to boost performance.
I currently manage a system like this. We have a standard relational database for input, and then copy the pertinent data for reporting to an OLAP server. Our combo is Microsoft SQL Server (input, raw data), Microsoft Analysis Services (pre-calculates then stores the analytical data to increase speed), and Microsoft Excel/Access Pivot Tables and/or Tableau for reporting.
OLAP servers:
http://en.wikipedia.org/wiki/Comparison_of_OLAP_Servers
Combining relational and OLAP:
http://en.wikipedia.org/wiki/HOLAP
Tableau:
http://www.tableausoftware.com/
*Tableau is a superb product, and can probably replace an OLAP server if your data isn't terribly large (even then it can handle a lot of data). It will make local copies as necessary to improve performance. I strongly advise giving it a look.
If I've misunderstood the issue you're having, then by all means please ignore this answer :\
UPDATE: After more discussion, an Object DB might be a solution as well. Your data sounds multi-dimensional in nature, one way or the other, but I think the difference would be whether you're doing analytic aggregate calculations and retrieval (SUMs, AVGs), or just storing and fetching categorical or relational data (shopping cart items, or friends of a family member).
ODBMS info: http://en.wikipedia.org/wiki/Object_database
InterSystem's Cache is one Object Database I know of that sounds like a more appropriate fit based on what you've said.
http://www.intersystems.com/cache/
If conversion to a different system isn't feasible (entirely understandable), then you might have to look at normalization and the types of data your queries are processing in order to gain further improvements in speed. In fact, that's probably a good first step before jumping to a different type of system (sorry I didn't get to this sooner).
In my case, I know on MS SQL that a switch we did from having some core queries use a VARCHAR field to using an INTEGER field made a huge difference in speed. Text data is one of the THE MOST expensive types of data to process. So for instance, if you have a query doing a lot of INNER JOINs on text fields, you might consider normalizing to the point where you're using INTEGER IDs that link to the text data.
An example of high normalization could be using ID numbers for a person's First or Last Name. Most DB designs store these names directly and don't attempt to reduce duplication, but you could normalize to the point where Last Name and/or First Name have their own tables (or one table to hold both First and Last names) and IDs for each unique name.
The point in your case would be more for performance than de-duplication of data, but something like switching from VARCHAR to INTEGER might have huge gains. I'd try it with a single field first, measure the before and after cases, and make your decision carefully from there.
And of course, in general you should be sure to have appropriate indexes on your data.
Hope that helps.
Document/Tree based database is designed to perform hierarchical queries. Do you have any hierarchical queries in your design -- I fail to see any? Querying one level up and down doesn't count: it is a simple join. Please have in mind that going "Document/Tree based database" route you would compromise your general querying ability. To summarize, just hire a competent db specialist who would analyze your performance bottlenecks -- they are usually cured with mundane index addition.
there's not really enough info here to say much useful - you'd need to measure things, look at "explains", etc - but one option that goes beyond the usual indexing would be to shard by level 3 instances. that would give you better performance on parallel queries that hit different shards, at its simplest (separate disks), or you could use separate machines if you want to throw more resources at each shard.
the only reason i mention this really is that your use cases suggest sharding at that level would work quite well (it looks like it would be simple enough to do in your application layer, if you wanted - i have no idea what tools mysql has for this).
and if your data volume isn't so high then with sharding you might be able to get it down to ssds...

Using a denormalized database table for analytic data

In an online ticketing system I've built, I need to add real-time analytical reporting on orders for my client.
Important order data is split over multiple tables (customers, orders, line_items, package_types, tickets). Each table contains additional data that is unimportant to any report my client may need.
I'm considering recording each order as a separate line item in a denormalized report table. I'm trying to figure out if this makes sense or not.
Generally, the queries I'm running for the report only have to join across two or three of the tables at a time. Each table has the appropriate indices added.
Does it make sense to compile all of the order data into one table that contains only the necessary columns for the reporting?
The application is built on Ruby on Rails 3 and the DB is Postgresql.
EDIT: The goal of this would be to render the data in the browser as fast as possible for the user.
depends on what your goal is. if you want to make the report outputs faster to display then that would certainly work. the trade off is that the data is somewhat maintained through batch updates. You could write a trigger that updates the table anytime a new record comes in to the base tables, but that could potentially add a lot of overhead.
Maybe a view instead of a new table is a better solution in this case?

What are the downsides of using SqlServer Views?

What are the downsides of using SqlServer Views?
I create views frequently to show my data in a denormalized form.
I find it much easier and therefore faster, less error prone, and more self documenting, to query one of these joins rather than to generate complex queries with complicated joins between many tables. Especially when I am analyzing the same data (many same fields, same table joins) from different angles.
But is there a cost to creating and using these views?
Am I slowing down (or speeding up?) query processing?
When comes to Views there are advantages and disadvantages.
Advantages:
They are virtual tables and not stored in the database as a distinct object. All that is stored is the SELECT statement.
It can be used as a security measure by restricting what the user can see.
It can make commonly used complex queries easier to read by encapsulating them into a view. This is a double edged sword though - see disadvantages #3.
Disadvantages:
It does not have an optimized execution plan cached so it will not be as fast as a stored procedure.
Since it is basically just an abstraction of a SELECT it is marginally slower than doing a pure SELECT.
It can hide complexity and lead to gotchas. (Gotcha: ORDER BY not honored).
My personal opinion is to not use Views but to instead use stored procedures as they provide the security and encapsulation of Views but also come with improved performance.
One possible downside of using views is that you abstract the complexity of the underlying design which can lead to abuse by junior developers and report creators.
For a particularly large and complex project I designed a set of views which were to be used mostly by report designers to populate crystal reports. I found out weeks later that junior devs had started using these views to fetch aggregates and join these already large views simply because they were there and were easy to consume. (There was a strong element of EAV design in the database.) I found out about this after junior devs started asking why seemingly simple reports were taking many minutes to execute.
The efficiency of a view depends in large part on the underlying tables. The view really is a just an organized an consistent way to look at query results. If the query used to form the view is good, and uses proper indexes on the underlying tables, then the view shouldn't negatively impact performance.
In SQL Server you can also create materialized or indexed views (since SQL Server 2000), which increase speed somewhat.
I use views regularly as well. One thing to note, however, is that using lots of views could be hard to maintain if your underlying tables change frequently (especially during development).
EDIT: Having said that, I find the convenience and advantage of being able to simplify and re-use complex queries outweighs the maintenance issue, especially if the views are used responsibly.
Views can be a detriment to performance when the view contains logic, columns, rows, or tables that aren't ultimately used by your final query. I can't tell you how many times I've seen stuff like:
SELECT ...
FROM (View with complex UNION of ActiveCustomer and InactiveCustomer tables)
WHERE Active = True
(thus filtering out all rows that were included in the view from the InactiveCustomer table), or
SELECT (one column)
FROM (view that returns 50 columns)
(SQL has to retrieve lots of data that is then discarded at a later step. Its possible those other columns are expensive to retrieve, like through a bookmark lookup), or
SELECT ...
FROM (view with complex filters)
WHERE (entirely different filters)
(its likely that SQL could have used a more appropriate index if the tables were queried directly),
or
SELECT (only fields from a single table)
FROM (view that contains crazy complex joins)
(lots of CPU overhead through the join, and unnecessary IO for the table reads that are later discarded), or my favorite:
SELECT ...
FROM (Crazy UNION of 12 tables each containing a month of data)
WHERE OrderDate = #OrderDate
(Reads 12 tables when it only really needs to read 1).
In most cases, SQL is smart enough to "see through the covers" and come up with an effective query plan anyway. But in other cases (especially very complex ones), it can't. In each of the above situations, the answer was to remove the view and query the underlying tables instead.
At the very least (even if you think SQL would be smart enough to optimize it anyway), eliminating the view can sometimes make your own query debugging and optimization easier (a bit more obvious what needs to be done).
A downside to views that I've run into is a dive in performance when incorporating them into distributed queries. This SQLMag article discusses - and whilst I use highly artificial data in the demo, I've run into this problem time and time again in the "real world".
Respect your views, and they'll treat you well.
What are Various Limitations of the Views in SQL Server?
Top 11 Limitations of Views
Views do not support COUNT (); however, it can support COUNT_BIG ()
ORDER BY clause does not work in View
Regular queries or Stored Procedures give us flexibility when we need another column; we can add a column to regular queries right away. If we want to do the same with Views, then we will have to modify them first
Index created on view not used often
Once the view is created and if the basic table has any column added or removed, it is not usually reflected in the view till it is refreshed
UNION Operation is not allowed in Indexed View
We cannot create an Index on a nested View situation means we cannot create index on a view which is built from another view.
SELF JOIN Not Allowed in Indexed View
Outer Join Not Allowed in Indexed Views
Cross Database Queries Not Allowed in Indexed View
Source SQL MVP Pinal Dave
http://blog.sqlauthority.com/2010/10/03/sql-server-the-limitations-of-the-views-eleven-and-more/
When I started I always though views added performance overhead, however experience paints a different story (the view mechanism itself has negligible overhead).
It all depends on what the underlying query is. Check out indexed views here or here , ultimately you should test the performance both ways to obtain a clear performance profile
My biggest 'gripe' is that ORDER BY does not work in a view. While it makes sense, it is a case which can jump up and bite if not expected. Because of this I have had to switch away from using views to SPROCS (which have more than enough problems of their own) in a few cases where I could not specify an ORDER BY later. (I wish there was a construct with "FINAL VIEW" -- e.g. possibly include order by -- semantics).
http://blog.sqlauthority.com/2010/10/03/sql-server-the-limitations-of-the-views-eleven-and-more/ (Limitation #1 is about ORDER BY :-)
The following is a SQL hack that allows an order by to be referenced in a view:
create view toto1 as
select top 99.9999 percent F1
from Db1.dbo.T1 as a
order by 1
But my preference is to use Row_Number:
create view toto2 as
select *, ROW_NUMBER() over (order by [F1]) as RowN from (
select f1
from Db1.dbo.T1) as a

SQL Server Indexed View Question

I have a requirement to create a report that is killing the processor and taking a long time to run.
I think I could speed this up significantly by creating an index view that keeps all this data in one place making it a lot easy to query/report on. This view would not just be used for the report as I think it would benefit quite a few areas in the data layer.
The indexed view will potentially contain 5 million+ records, I cant seem to find any guidance as to at what point indexed views are not longer recommended. I assume that an index view of this size would take considerable time to build when SQL first starts, but I would hope after this the cost of maintaining it would be minimal.
Is there any kind of best practice guide as to when to use index views and when not to use them? Would the view rebuild itself after every server restart or does it get stored somewhere on the disk?
The index associated with your Indexed View will be updated whenever updates are made to the any of the columns in the index.
High numbers of updates will most likely kill the benefit. If it is mainly reads then it will work fine.
The real benefits of Indexed Views are when you have aggregates that are too expensive to compute in real time.
Please see: Improving Performance with SQL Server 2008 Indexed Views:
Indexed views can increase query
performance in the following ways:
Aggregations can be precomputed and stored in the index to minimize
expensive computations during query
execution.
Tables can be prejoined and the resulting data set stored.
Combinations of joins or aggregations can be stored.
The query optimizer considers indexed
views only for queries with nontrivial
cost. This avoids situations where
trying to match various indexed views
during the query optimization costs
more than the savings achieved by the
indexed view usage. Indexed views are
rarely used in queries with a cost of
less than 1.
Applications that benefit from the
implementation of indexed views
include:
Decision support workloads.
Data marts.
Data warehouses.
Online analytical processing (OLAP) stores and sources.
Data mining workloads.
From the query type and pattern point
of view, the benefiting applications
can be characterized as those
containing:
Joins and aggregations of large tables.
Repeated patterns of queries.
Repeated aggregations on the same or overlapping sets of columns.
Repeated joins of the same tables on the same keys.
Combinations of the above.
An indexed view (aka materialized view) is maintained by SQL Server after every change to the underlying table(s). Needless to say, you should not have an indexed view on a table that has traffic.
For your problem, a better solution would be to run the query and store it in its own table, like:
select * into CachedReport from YourView
That will give you the performance of an indexed view, while you can decide when to refresh it. For example, you could refresh it by running the select into query from a scheduled job every night.
I'm not aware of any guidance concerning size of indexed views. It's effectively a separate table that's being "automagically" updated every time the base tables on which it depends are updated, so I tend to think of it as a separate table.
As to your question on the building of the index - it's stored on disk, the same as every other index, so it doesn't get rebuilt during server restart (other than any repair that takes place due to transactions not having completed before the restart).
There's no hard row number limit on when to use a table or a materialised view.
However as a guide line avoid a materialised view over volatile tables - the load can kill your server.
First off as Timothy suggested check the indexes on your underlying tables, then the statistics. Your Query Optimiser might be just on the complete track due to missing/out of date statistics.
If this doesn't help with performance check what data is really required from the view as my guess is that a) the row count and b) the row size is what is killing your server loading the whole view into temp table and running it through I/O contention.

Oracle Multiple Schemas Aggregate Real Time View

All,
Looking for some guidance on an Oracle design decision I am currently trying to evaluate:
The problem
I have data in three separate schemas on the same oracle db server. I am looking to build an application that will show data from all three schemas, however the data that is shown will be based on real time sorting and prioritisation rules that is applied to the data globally (i.e.: based on the priority weightings applied I may pull back data from any one of the three schemas).
Tentative Solution
Create a VIEW in the DB which maintains logical links to the relevant columns in the three schemas, write a stored procedure which accepts parameterised priority weightings. The application subsequently calls the stored procedure to select the ‘prioritised’ row from the view and then queries the associated schema directly for additional data based on the row returned.
I have concerns over performance where the data is being sorted/ prioritised upon each query being performed but cannot see a way around this as the prioritisation rules will change often. We are talking of data sets in the region of 2-3 million rows per schema.
Does anyone have alternative suggestions on how to provide an aggregated and sorted view over the data?
Querying from multiple schemas (or even multiple databases) is not really a big deal, even inside the same query. Just prepend the table name with the schema you are interested in, as in
SELECT SOMETHING
FROM
SCHEMA1.SOME_TABLE ST1, SCHEMA2.SOME_TABLE ST2
WHERE ST1.PK_FIELD = ST2.PK_FIELD
If performance becomes a problem, then that is a big topic... optimal query plans, indexes, and your method of database connection can all come into play. One thing that comes to mind is that if it does not have to be realtime, then you could use materialized views (aka "snapshots") to cache the data in a single place. Then you could query that with reasonable performance.
Just set the snapshots to refresh at an interval appropriate to your needs.
It doesn't matter that the data is from 3 schemas, really. What's important to know is how frequently the data will change, how often the criteria will change, and how frequently it will be queried.
If there is a finite set of criteria (that is, the data will be viewed in a limited number of ways) which only change every few days and it will be queried like crazy, you should probably look at materialized views.
If the criteria is nearly infinite, then there's no point making materialized views since they won't likely be reused. The same holds true if the criteria itself changes extremely frequently, the data in a materialized view wouldn't help in this case either.
The other question that's unanswered is how often the source data is updated, and how important is it to have the newest information. Frequently updated source day can either mean a materialized view will get "stale" for some duration or you may be spending a lot of time refreshing the materialized views unnecessarily to keep the data "fresh".
Honestly, 2-3 million records isn't a lot for Oracle anymore, given sufficient hardware. I would probably benchmark simple dynamic queries first before attempting fancy (materialized) view.
As others have said, querying a couple of million rows in Oracle is not really a problem, but then that depends on how often you are doing it - every tenth of a second may cause some load on the db server!
Without more details of your business requirements and a good model of your data its always difficult to provide good performance ideas. It usually comes down to coming up with a theory, then trying it against your database and accessing if it is "fast enough".
It may also be worth you taking a step back and asking yourself how accurate the results need to be. Does the business really need exact values for this query or are good estimates acceptable
Tom Kyte (of Ask Tom fame) always has some interesting ideas (and actual facts) in these areas. This article describes generating a proper dynamic search query - but Tom points out that when you query Google it never tries to get the exact number of hits for a query - it gives you a guess. If you can apply a good estimate then you can really improve query performance times

Resources