We're using DBT to run automated CI/CD to provision all our resources in Snowflake, including databases, schemas, users, roles, warehouses, etc.
The issue comes up when we're creating warehouses -- the active warehouse automatically switches over to the newly created one. And this happens whether or not the warehouse already exists (we're using CREATE WAREHOUSE IF NOT EXISTS commands).
This basically resumes/turns on all our warehouses for no reason (even though we're using INITIALLY_SUSPENDED = TRUE), because snowflake is then using that warehouse to execute the subsequent queries. And then our CI/CD continues on the wrong warehouse (whichever one was the last one to execute). We have a dedicated warehouse for CI/CD, and we'd like the execution to remain on that one (so we can monitor the costs).
We're aware that this is the default behavior specified in the documentation, but is there any way to create a warehouse without using it?
I wish the CREATE WAREHOUSE command had a parameter like USE_WAREHOUSE = TRUE|FALSE.
As a workaround, we're exploring ways to skip the CREATE WAREHOUSE commands entirely if the warehouse already exists, but that doesn't solve the issue for warehouses that do need to be created.
Otherwise, we might just add a USE WAREHOUSE command after every CREATE WAREHOUSE, in order to return to the original CI/CD warehouse.
The idea is to store current warehouse in a variable and restore it:
SET warehouse_name = (SELECT CURRENT_WAREHOUSE());
CREATE WAREHOUSE TEST WAREHOUSE_SIZE=XSMALL, INITIALLY_SUSPENDED=TRUE;
USE WAREHOUSE IDENTIFIER($warehouse_name);
Alternatively wrapping it with stored procedure(simplified version - no error handling and only warehouse name provided as parameter):
CREATE OR REPLACE PROCEDURE create_warehouse(CURRENT_WAREHOUSE_NAME STRING
,WAREHOUSE_NAME STRING)
RETURNS VARCHAR
LANGUAGE javascript
AS
$$
var rs = snowflake.execute({sqlText: `CREATE WAREHOUSE IF NOT EXISTS IDENTIFIER(?) WAREHOUSE_SIZE=MEDIUM, INITIALLY_SUSPENDED=TRUE`, binds:[WAREHOUSE_NAME]});
// restore original warehouse, USE WAREHOUSE cannot be used inside SP
var rs2 = snowflake.execute({sqlText:`CREATE WAREHOUSE IF NOT EXISTS IDENTIFIER(?)`, binds:[CURRENT_WAREHOUSE_NAME]});
return 'Done.';
$$;
CALL create_warehouse(CURRENT_WAREHOUSE(), 'TEST');
The Snowflake docs describe an INITIALLY_SUSPENDED property, defaulting FALSE, which specifies whether the warehouse should be created initially in Suspended state.
I think you should set that property TRUE in your script.
Related
I have a SQL query that references existing schemas/tables in snowflake that I can run in worksheets to see my data.
The trouble is I want to make a scheduled task to wrap this SQL query in so that it runs the query every 3hrs and sends the data retrieved by said query to a table that may/may not already exist. If not, I want it to auto-populate the fields of this new table with the right field types.
My resultant data:
USER_ID|PATIENT|DATE|QUANTITY|QUANTITY_DURATION|EXPIRATION|MAX_DURATION|LAST_QUANTITY_DAY|DAYS_REMAINING|
1111|2222|2021-01-06 00:00:00.000|1|90|730|90|2021-04-06|-4
...
From https://docs.snowflake.com/en/sql-reference/sql/create-task.html, I know the syntax generally to be:
CREATE TASK mytask
WAREHOUSE = mywh
SCHEDULE = '3 HOUR'
AS
INSERT INTO mytable(ts) VALUES(CURRENT_TIMESTAMP);
But I'm unsure how to make it more automated such that if someone accidentally deletes the table it will make it again on its own.
I have a snowflake Javascript stored procedure that takes in 2 arguments and performs runtime dynamic aggregation based on the columns selected from UI
Adding more context below
1 argument for selected columns (selected filter/dropdown attribute or column name sent from UI)
2 argument for the dynamic where clause prepared for the similar dropdown values selected in point 1
Data is fetched from a view and result is retrieved in this fashion
CREATE OR REPLACE PROCEDURE database.schema.sp_sample(dynamic_columns VARCHAR, dynamic_where_clause VARCHAR)
RETURNS VARCHAR
LANGUAGE JAVASCRIPT
EXECUTE AS CALLER
AS
$$
rs="Success";
try {
retrieve_queries_sql = `SELECT COL_1, COL_2, ${DYNAMIC_COLUMNS} FROM view ${DYNAMIC_WHERE_CLAUSE} GROUP BY COL_1, COL_2, ${DYNAMIC_COLUMNS}`;
var stmt = snowflake.createStatement( {sqlText: retrieve_queries_sql, binds:[DYNAMIC_COLUMNS, DYNAMIC_WHERE_CLAUSE, DYNAMIC_COLUMNS]} );
var rs = stmt.execute();
}
catch(err) {
rs= "Failed Message: "+err.message;
}
return rs;
$$;
CALL "database"."schema"."sp_sample"('COL_3','WHERE COL_3\=\'somevalue\'');
SELECT * FROM TABLE(RESULT_SCAN(LAST_QUERY_ID(-2)));
Note: There is no requirement at this point to show data specific to a certain user. Also ignore the argument interpolation part in the code above
Questions:
Is there a concern if this snowflake dynamic stored procedure is called by this UI application (multiple users concurrently selecting values on the UI and this dynamic procedure internally will be called many times) and its a read operation (select only just fetch the aggregated values for the dropdown hierarchies / levels that the users chooses)
Will there be any question of data integrity where different users will not see updated values on the UI (to provide more context data is fetched from a view within this procedure and this view data will not change during the application live / active time - all batch will be completed before the application is accessed by users). So the data is ready and no changes to the data are expected to happen when users will be accessing the application
Got a suggestion from someone that the dynamic procedure will cause data integrity concern where users will not be able to see updated data and to resolve that data integrity problem pre-aggregated views (meaning data is pre-aggregated at certain levels and kept in different views) will help and data consistency will be better.
Wanted to understand how this proposal of pre-aggregated views can help to solve if ever the problem is of data integrity (in the first place why there will be a data integrity concern) because what was done in the procedure is fetching the data from a view (dynamically aggregated) and even in this latter approach we are still creating many other views (pre-aggregated) eventually we are getting data from views in both cases, How does the latter cannot cause a data integrity concern?
Please share some feedback if the points provided makes any sense. Would like to know better opinions
No, there is no concern. The LAST_QUERY_ID returns the ID of a specified query in the current session.
https://docs.snowflake.com/en/sql-reference/functions/last_query_id.html
No, there won't be any question of data integrity - It's not different then executing a complex SQL directly. You may check isolation level:
https://docs.snowflake.com/en/sql-reference/transactions.html#isolation-level
I didn't understand how it helps to data integrity. In fact, if you use this approach, you may have data integrity issues if you combine this pre-aggregated data with the up-to-date data. Maybe they are the same person who thinks Snowflake is a Hadoop variant? :)
I am trying to find out the differences in the way you can define which database to use in SSMS.
Is there any functional difference between using the 'Available Databases' drop down list
Adventure works Available Databases dropdown,
the database being defined in the query
SELECT * FROM AdventureWorks2008.dbo.Customers
and
stating the database at the start?
USE AdventureWorks2008
GO
SELECT * FROM dbo.Customers
I'm interested to know if there is a difference in terms of performance or something that happens behind the scenes for each case.
Thank you for your help
Yes, there is. Very very small overhead is added when you use "USE AdventureWorks2008" as it will execute it against the database every time you execute the query. It also will print the "Command(s) completed successfully.". However it is so small overhead and if you are OK with this message then just do not care about that.
Yes, there can be the difference.
When you execute statements like this: SELECT * FROM AdventureWorks2008.dbo.Customers in the context of another database (not AdventureWorks2008) that another database's settings are applied.
First of all, any database has its Compatibility Level that can be different, so it can limit usage of some code, for example you cannot use APPLY operator in the context of database with CL set to 80 but you can do it within database with CL >= 90
Second, every database has its own set of options such as AUTO_UPDATE_STATISTICS_ASYNC and Forced Parameterization that can affect your query plan.
I did encounter some cases when the context of database influenced the plan:
One case was when I created filtered index for one table and it was used in the plan until I executed my query in the context of database with Simple parameterization, and it was not used for the same query when executed in the context of database with Forced parameterization. When I used the hint to force that index I've got the error that the query plan cannot be produced due to query hint, so I need to investigate and I found out that my query was parameterized and instead of my condition fld = 0 there was fld = #p and it could not use my filtered index with fld = 0 condition.
The second case was reguarding table cardinality estimation: we use staging tables to load the data in our ETL procedures and then switch then to actual tables like this:
insert into stg with(tablock);
...
truncate table actual;
alter table stg swith to actual;
All the staging tables are empty when the procedure compiles but within the proc they are filled with the data, so when we do joins between them they are not emty anymore. Passing from 0 rows to non-0 rows triggers statement recompilation that should take in consideration actual number of rows, but it did not happen on the production server, so all estimations were completely wrong (1 row fo every table) and I need to investigate. The cause was AUTO_UPDATE_STATISTICS_ASYNC set to ON in production database.
Now imagine you have 2 db: db1 and db2 with this option set to ON and OFF respectively, in db1 this code will have wrong estimations while if you execute it in db2 using db1.dbo.stg it will have right estimations. The execution time will be very different in these 2 databases.
Background
I have a multi-tenant scenario and a unique Sql Server project that will be deployed into multiple databases instances on the same server. There will be one db for each tenant, plus one "model" db.
The "model" database serves three purposes:
Force some "system" data to be always present in each tenant database
Serves as an access point for users with a special permission to edit system data (which will be punctually synced to all tenants)
When creating a new tenant, the database will be copied and attached with a new name representing the tenant
There are triggers that checks if the modified / deleted data within tenant db corresponds to "system" data inside the "model" db. If it does, an error is raised saying that system data cannot be altered.
Issue
So here's a part of the trigger that checks if deletion can be allowed:
IF DB_NAME() <> 'ModelTenant' AND EXISTS
(
SELECT
[deleted].*
FROM
[deleted]
INNER JOIN [---MODEL DB NAME??? ---].[MySchema].[MyTable] [ModelTable]
ON [deleted].[Guid] = [ModelTable].[Guid]
)
BEGIN;
THROW 50000, 'The DELETE operation on table MyTable cannot be performed. At least one targeted record is reserved by the system and cannot be removed.', 1
END
I can't seem to find what should take the place of --- MODEL DB NAME??? --- in the above code that would allow the project to compile properly. When refering to a completely different project I know what to do: use a reference to the project that's represented with an SQLCMD variable. But in this scenario the project reference is essentially the same project; only on a different database. I can't seem to be able to add a self-reference in this manner.
What can I do? Does SSDT offers some kind of support for such a scenario?
Have you tried setting up a Database Variable? You can read under "Reference aware statements" here. You could then say:
SELECT * FROM [$(MyModelDb)][MySchema].[MyTable] [ModelTable]
If you don't have a specific project for $(MyModelDb) you can choose the option to "suppress errors by unresolved references...". It's been forever since I've used SSDT projects, but I think that should work.
TIP: If you need to reference 1-table 100-times, you may find it better to create a SYNONM that uses the database variable, then point to the SYNONM in your SPROCs/TRIGGERs. Why? Because that way you don't need to deploy your SPROCs/TRIGGERs to get the variable replaced with the actual value and that can make development smoother.
I'm not quite sure if SSDT is particularly well-suited to projects of any decent amount of complexity. I can think of one or two ways to most likely accomplish this (especially depending on exactly how you do the publishing / deployment), but I think you would actually lose more than you gain. What I mean by that is: you could add steps to get this to work (i.e. win the battle), but you would be creating a more complex system in order to get SSDT to publish a system that is more complex (and slower) than it needs to be (i.e. lose the war).
Before worrying about SSDT, let's look at why you need/want SSDT to do this in the first place. You have system data intermixed with tenant data, and you need to validate UPDATE and DELETE operations to ensure that the system data does not get modified, and the only way to identify data that is "system" data is by matching it to a home-of-record -- ModelDB -- based on GUID PKs.
That theory on identifying what data belongs to the "system" and not to a tenant is your main problem, not SSDT. You are definitely on the right track for a multi-tentant system by having the "model" database, but using it for data validation is a poor design choice: on top of the performance degradation already incurred from using GUIDs as PKs, you are further slowing down all of these UPDATE and DELETE operations by funneling them through a single point of contention since all client DBS need to check this common source.
You would be far better off to include a BIT field in each of these tables that mixes system and tenant data, denoting whether the row was "system" or not. Just look at the system catalog views within SQL Server:
sys.objects has an is_ms_shipped column
sys.assemblies went the other direction and has an is_user_defined column.
So, if you were to add an [IsSystemData] BIT NOT NULL column to these tables, your Trigger logic would become:
IF DB_NAME() <> N'ModelTenant' AND EXISTS
(
SELECT del.*
FROM [deleted] del
WHERE del.[IsSystemData] = 1
)
BEGIN
;THROW 50000, 'The DELETE operation on table MyTable cannot be performed. At least one targeted record is reserved by the system and cannot be removed.', 1;
END;
Benefits:
No more SSDT issue (at least for from this part ;-)
Faster UPDATE and DELETE operations
Less contention on the shared resource (i.e. ModelDB)
Less code complexity
As an alternative to referencing another database project, you can produce a dacpac, then reference the dacpac as a database reference in "same server, different database" mode.
Can they (malicious users) describe tables and get vital information? What about if I lock down the user to specific tables? I'm not saying I want sql injection, but I wonder about old code we have that is susceptible but the db user is locked down. Thank you.
EDIT: I understand what you are saying but if I have no response.write for the other data, how can they see it. The bringing to crawl and dos make sense, so do the others but how would they actually see the data?
Someone could inject SQL to cause an authorization check to return the equivalent of true instead of false to get access to things that should be off-limits.
Or they could inject a join of a catalog table to itself 20 or 30 times to bring database performance to a crawl.
Or they could call a stored procedure that runs as a different database user that does modify data.
'); SELECT * FROM Users
Yes, you should lock them down to only the data (tables/views) they should actually be able to see, especially if it's publicly facing.
Only if you don't mind arbitrary users reading the entire database. For example, here's a simple, injectable login sequence:
select * from UserTable where userID = 'txtUserName.Text' and password = 'txtPassword.Text'
if(RowCount > 0) {
// Logged in
}
I just have to log in with any username and password ' or 1 = 1 to log in as that user.
Be very careful. I am assuming that you have removed drop table, alter table, create table, and truncate table, right?
Basically, with good SQL Injection, you should be able to change anything that is dependent on the database. This could be authorization, permissions, access to external systems, ...
Do you ever write data to disk that was retrieved from the database? In that case, they could upload an executable like perl and a perl file and then execute them to gain better access to your box.
You can also determine what the data is by leveraging a situation where a specific return value is expected. I.e. if the SQL returns true, execution continues, if not, execution stops. Then, you can use a binary search in your SQL. select count(*) where user_password > 'H'; If the count is > 0 it continues. Now, you can find the exact plain text password without requiring it to ever be printed on the screen.
Also, if your application is not hardened against SQL errors, there might be a case where they can inject an error in the SQL or in the SQL of the result and have the result display on the screen during the error handler. The first SQL statement collects a nice list of usernames and passwords. The second statement tries to leverage them in a SQL condition for which they are not appropriate. If the SQL statement is displayed in this error condition, ...
Jacob
I read this question and answers because I was in the process of creating a SQL tutorial website with a readonly user that would allow end users to run any SQL.
Obviously this is risky and I made several mistakes. Here is what I learnt in the first 24 hours (yes most of this is covered by other answers but this information is more actionable).
Do not allow access to your user table or system tables:
Postgres:
REVOKE ALL ON SCHEMA PG_CATALOG, PUBLIC, INFORMATION_SCHEMA FROM PUBLIC
Ensure your readonly user only has access to the tables you need in
the schema you want:
Postgres:
GRANT USAGE ON SCHEMA X TO READ_ONLY_USER;
GRANT SELECT ON ALL TABLES IN SCHEMA X TO READ_ONLY_USER
Configure your database to drop long running queries
Postgres:
Set statement_timeout in the PG config file
/etc/postgresql/(version)/main/postgresql.conf
Consider putting the sensitive information inside its own Schema
Postgres:
GRANT USAGE ON SCHEMA MY_SCHEMA TO READ_ONLY_USER;
GRANT SELECT ON ALL TABLES IN SCHEMA MY_SCHEMA TO READ_ONLY_USER;
ALTER USER READ_ONLY_USER SET SEARCH_PATH TO MY_SCHEMA;
Take care to lock down any stored procedures and ensure they can not be run by the read only user
Edit: Note by completely removing access to system tables you no longer allow the user to make calls like cast(). So you may want to run this again to allow access:
GRANT USAGE ON SCHEMA PG_CATALOG to READ_ONLY_USER;
Yes, continue to worry about SQL injection. Malicious SQL statements are not just about writes.
Imagine as well if there were Linked Servers or the query was written to access cross-db resources. i.e.
SELECT * from someServer.somePayrollDB.dbo.EmployeeSalary;
There was an Oracle bug that allowed you to crash the instance by calling a public (but undocumented) method with bad parameters.