How to test/validate views in Snowflake - snowflake-cloud-data-platform

Sometimes, upstream DDL changes can break downstream views (it shouldn't happen, but humans make mistakes).
In order to detect these defects before our stakeholders do, is there a way to automatically test the validity of all views in Snowflake?

A view is not automatically updated if the underlying source objects are modified.
Currently, Snowflake does not have a way of tracking which views are no longer valid, as the view definition is validated at the execution time.
Workaround
You may use the object dependencies view/GET_OBJECT_REFERENCES function to identify the list of views and try to rebuild the view definition.
https://docs.snowflake.com/en/sql-reference/account-usage/object_dependencies.html
https://docs.snowflake.com/en/user-guide/object-dependencies.html#label-object-dependencies-features
Use the task feature to check the source tables status and notify.
You could create a Stored Procedure using the SNOWFLAKE.ACCOUNT_USAGE.VIEWS and GET_OBJECT_REFERENCES() to get the invalid views.

We started calling this query on each view, and it will produce an error if the view is invalid (hence failing the test):
EXPLAIN SELECT 1 FROM database.schema.view LIMIT 1;
Even though the query without EXPLAIN is extremely simple, it can still be slow on more complex views.
EXPLAIN only builds the query plan rather than executing the query (and the query planning will fail if the view is invalid).
The query plan is built exclusively in the cloud services layer, so those queries do not require an active warehouse, and are essentially free (as long as your cloud services usage remains below 10% of your total usage).
We use DBT to run this test every hour, but you could use any other tool that allows to automate queries (such as SnowSQL).
Some details on our usage of DBT:
We have added a project variable called test_views.
We have added a test_view() macro:
{%- macro test_view(model) -%}
EXPLAIN SELECT 1 FROM {{ model.database }}.{{ model.schema }}.{{ model.alias }} LIMIT 1;
{%- endmacro -%}
We have added the following code to the view materialization:
{%- set sql = test_view(model) if (model.config.materialized == 'view' and var('test_views') == True) else sql -%}
We can run the test with this command:
dbt run -s config.materialized:view --vars '{test_views: true}'
Another option would be to simply re-create the views constantly (which would fail if there is an issue with the upstream dependencies), but that just seems terribly wasteful.

Related

dbt incremental load merge operation without IS_INCREMENTAL() macro

I have set my model to be incremental by
{{
config(
materialized='incremental',
unique_key='my_key',
incremental_strategy='merge'
)
}}
Do I absolutely have to have IS_INCREMENTAL() macro at the end of the statement?
If not, then without IS_INCREMENTAL(),
1- Is dbt running a full refresh in the pipeline?
2- Is dbt still merging the output on my_key in Snowflake? Or is dbt going to completely overwrite the entire Snowflake table with the full refresh output?
is_incremental() is just a convenience macro. It evaluates to True if the model is being run in "incremental mode" (i.e., the table already exists and the --full-refresh flag has not been passed).
If a model is running for the first time or --full-refresh is set, dbt will drop the target table and create a new table with the results of the select statement in your model.
Otherwise, in incremental mode, an incremental model will attempt to upsert any records returned by the select statement.
Typically, an incremental model's source continues to grow over time (maybe it's a table of 100M records growing by 2M records/week). In this case, you would use something like:
select *
from source
{% if is_incremental() %}
where updated_at > (select max(updated_at) from {{ this }})
{% endif %}
so that you're only selecting the new records from your source table.
However, if your source table already purges old records, you may not need the special where clause. Just note that this isn't the pattern dbt was designed for, and if someone ever runs --full-refresh on your prod target you'll lose all of the old data.
If you do have a source that purges (deletes/truncates) old records, I would consider creating a snapshot of that source, instead of loading the values into an incremental model directly. Snapshots are designed for this kind of use case, and will generalize better across multiple dev environments, etc.
Incremental docs
Snapshot docs

How to Get Query Plan During Execution

Is it possible to capture the query text of the entire RPC SELECT statement from within the view which it calls? I am developing an enterprise database in SQL Server 2014 which must support a commercial application that submits a complex SELECT statement in which the outer WHERE clause is critical. This statement also includes subqueries against the same object (in my case, a view) which do NOT include that condition. This creates a huge problem because the calling application is joining those subqueries to the filtered result on another field and thus producing ID duplication that throws errors elsewhere.
The calling application assumes it is querying a table (not a view) and it can only be configured to use a simple WHERE clause. My task is to implement a more sophisticated security model behind this rather naive application. I can't re-write the offending query but I had hoped to retrieve the missing information from the cached query plan. Here's a super-simplified psuedo-view of my proposed solution:
CREATE VIEW schema.important_data AS
WITH a AS (SELECT special_function() AS condition),
b AS (SELECT c.criteria
FROM a, lookup_table AS c
WHERE a.condition IS NULL OR c.criteria = a.condition)
SELECT d.field_1, d.field_2, d.filter_field
FROM b, underlying_table AS d
WHERE d.filter_field = b.criteria;
My "special function" reads the query plan for the RPC, extracts the WHERE condition and preemptively filters the view so it always returns only what it should. Unfortunately, query plans don't seem to be cached until after they are executed. If the RPC is made several times, my solution works perfectly -- but only on the second and subsequent calls.
I am currently using dm_exec_cached_plans, dm_exec_query_stats and dm_exec_sql_text to obtain the full text of the RPC by searching on spid, plan creation time and specific words. It would be so much better if I could somehow get the currently executing plan. I believe dm_exec_requests does that but it only returns the current statement -- which is just the definition of my function.
Extended Events look promising but unfamiliar and there is a lot to digest. I haven't found any guidance, either, on whether they are appropriate for this particular challenge, whether a notification can be obtained in real time or how to ensure that the Event Session is always running. I am pursuing this investigation now and would appreciate any suggestions or advice. Is there anything else I can do?
This turns out to be an untenable idea and I have devised a less elegant work-around by restructuring the view itself. There is a performance penalty but the downstream error is avoided. My fundamental problem is the way the client application generates its SQL statements and there is nothing I do about that -- so, users will just have to accept whatever limitations may result.

How can I refer to a database deployed from the same DACPAC file, but with a different db name?

Background
I have a multi-tenant scenario and a unique Sql Server project that will be deployed into multiple databases instances on the same server. There will be one db for each tenant, plus one "model" db.
The "model" database serves three purposes:
Force some "system" data to be always present in each tenant database
Serves as an access point for users with a special permission to edit system data (which will be punctually synced to all tenants)
When creating a new tenant, the database will be copied and attached with a new name representing the tenant
There are triggers that checks if the modified / deleted data within tenant db corresponds to "system" data inside the "model" db. If it does, an error is raised saying that system data cannot be altered.
Issue
So here's a part of the trigger that checks if deletion can be allowed:
IF DB_NAME() <> 'ModelTenant' AND EXISTS
(
SELECT
[deleted].*
FROM
[deleted]
INNER JOIN [---MODEL DB NAME??? ---].[MySchema].[MyTable] [ModelTable]
ON [deleted].[Guid] = [ModelTable].[Guid]
)
BEGIN;
THROW 50000, 'The DELETE operation on table MyTable cannot be performed. At least one targeted record is reserved by the system and cannot be removed.', 1
END
I can't seem to find what should take the place of --- MODEL DB NAME??? --- in the above code that would allow the project to compile properly. When refering to a completely different project I know what to do: use a reference to the project that's represented with an SQLCMD variable. But in this scenario the project reference is essentially the same project; only on a different database. I can't seem to be able to add a self-reference in this manner.
What can I do? Does SSDT offers some kind of support for such a scenario?
Have you tried setting up a Database Variable? You can read under "Reference aware statements" here. You could then say:
SELECT * FROM [$(MyModelDb)][MySchema].[MyTable] [ModelTable]
If you don't have a specific project for $(MyModelDb) you can choose the option to "suppress errors by unresolved references...". It's been forever since I've used SSDT projects, but I think that should work.
TIP: If you need to reference 1-table 100-times, you may find it better to create a SYNONM that uses the database variable, then point to the SYNONM in your SPROCs/TRIGGERs. Why? Because that way you don't need to deploy your SPROCs/TRIGGERs to get the variable replaced with the actual value and that can make development smoother.
I'm not quite sure if SSDT is particularly well-suited to projects of any decent amount of complexity. I can think of one or two ways to most likely accomplish this (especially depending on exactly how you do the publishing / deployment), but I think you would actually lose more than you gain. What I mean by that is: you could add steps to get this to work (i.e. win the battle), but you would be creating a more complex system in order to get SSDT to publish a system that is more complex (and slower) than it needs to be (i.e. lose the war).
Before worrying about SSDT, let's look at why you need/want SSDT to do this in the first place. You have system data intermixed with tenant data, and you need to validate UPDATE and DELETE operations to ensure that the system data does not get modified, and the only way to identify data that is "system" data is by matching it to a home-of-record -- ModelDB -- based on GUID PKs.
That theory on identifying what data belongs to the "system" and not to a tenant is your main problem, not SSDT. You are definitely on the right track for a multi-tentant system by having the "model" database, but using it for data validation is a poor design choice: on top of the performance degradation already incurred from using GUIDs as PKs, you are further slowing down all of these UPDATE and DELETE operations by funneling them through a single point of contention since all client DBS need to check this common source.
You would be far better off to include a BIT field in each of these tables that mixes system and tenant data, denoting whether the row was "system" or not. Just look at the system catalog views within SQL Server:
sys.objects has an is_ms_shipped column
sys.assemblies went the other direction and has an is_user_defined column.
So, if you were to add an [IsSystemData] BIT NOT NULL column to these tables, your Trigger logic would become:
IF DB_NAME() <> N'ModelTenant' AND EXISTS
(
SELECT del.*
FROM [deleted] del
WHERE del.[IsSystemData] = 1
)
BEGIN
;THROW 50000, 'The DELETE operation on table MyTable cannot be performed. At least one targeted record is reserved by the system and cannot be removed.', 1;
END;
Benefits:
No more SSDT issue (at least for from this part ;-)
Faster UPDATE and DELETE operations
Less contention on the shared resource (i.e. ModelDB)
Less code complexity
As an alternative to referencing another database project, you can produce a dacpac, then reference the dacpac as a database reference in "same server, different database" mode.

Inline SQL versus stored procedure

I have a simple SELECT statement with a couple columns referenced in the WHERE clause. Normally I do these simple ones in the VB code (setup a Command object, set Command Type to text, set Command Text to the Select statement). However I'm seeing timeout problems. We've optimized just about everything we can with our tables, etc.
I'm wondering if there'd be a big performance hit just because I'm doing the query this way, versus creating a simple stored procedure with a couple params. I'm thinking maybe the inline code forces SQL to do extra work compiling, creating query plan, etc. which wouldn't occur if I used a stored procedure.
An example of the actual SQL being run:
SELECT TOP 1 * FROM MyTable WHERE Field1 = #Field1 ORDER BY ID DESC
A well formed "inline" or "ad-hoc" SQL query - if properly used with parameters - is just as good as a stored procedure.
But this is absolutely crucial: you must use properly parametrized queries! If you don't - if you concatenate together your SQL for each request - then you don't benefit from these points...
Just like with a stored procedure, upon first executing, a query execution plan must be found - and then that execution plan is cached in the plan cache - just like with a stored procedure.
That query plan is reused over and over again, if you call your inline parametrized SQL statement multiple times - and the "inline" SQL query plan is subject to the same cache eviction policies as the execution plan of a stored procedure.
Just from that point of view - if you really use properly parametrized queries - there's no performance benefit for a stored procedure.
Stored procedures have other benefits (like being a "security boundary" etc.), but just raw performance isn't one of their major plus points.
It is true that the db has to do the extra work you mention, but that should not result in a big performance hit (unless you are running the query very, very frequently..)
Use sql profiler to see what is actually getting sent to the server. Use activity monitor to see if there are other queries blocking yours.
Your query couldn't be simpler. Is Field1 indexed? As others have said, there is no performance hit associated with "ad-hoc" queries.
For where to put your queries, this is one of the oldest debates in tech. I would argue that your requests "belong" to your application. They will be versionned with your app, tested with your app and should disappear when your app disappears. Putting them anywhere other than in your app is walking into a world of pain. But for goodness sake, use .sql files, compiled as embedded resources.
Select statement which is part of form clause of any
another statement is called as inline query.
Cannot take parameters.
Not a database object
Procedure:
Can take paramters
Database object
can be used globally if same action needs to be performed.

Consolidating NHibernate open sessions to the DB (coupling NHibernate and DB sessions)

We use NHibernate for ORM, and at the initialization phase of our program we need to load many instances of some class T from the DB.
In our application, the following code, which extracts all these instances, takes forever:
public IList<T> GetAllWithoutTransaction()
{
using (ISession session = GetSession())
{
IList<T> entities = session
.CreateCriteria(typeof(T))
.List<T>();
return entities;
}
}
}
Using the NHibernate log I found that the actual SQL queries the framework uses are:
{
Load a bunch of rows from a few tables in the DB (one SELECT statement).
for each instance of class T
{
Load all the data for this instance of class T from the abovementioned rows
(3 SELECT statements).
}
}
The 3 select statements are coupled, i.e. the second is dependent on the first, and the third on the first two.
As you can see, the number of SELECT statements is in the millions, giving us a huge overhead which results directly from all those calls to the DB (each of which entails "open DB session", "close DB session", ...), even though we are using a single NHibernate session.
My question is this:
I would somehow like to consolidate all those SELECT statements into one big SELECT statement. We are not running multithreaded, and are not altering the DB in any way at the init phase.
One way of doing this would be to define my own object and map it using NHibernate, which will be loaded quickly and load everything in one query, but it would require that we implement the join operations used in those statements ourselves, and worse - breaks the ORM abstraction.
Is there any way of doing this by some configuration?
Thanks guys!
This is known as the SELECT N+1 problem. You need to decide where you're going to place your joins (FetchMode.Eager)
If you can write the query as a single query in SQL, you can get NHibernate to execute it as a single query (usually without breaking the abstraction).
It appears likely that you have some relationships / classes setup to lazy load when what you really want in this scenario is eager loading.
There is plenty of good information about this in the NHibernate docs. You may want to start here:
http://www.nhforge.org/doc/nh/en/index.html#performance-fetching

Resources