Snowflake dynamic stored procedure - snowflake-cloud-data-platform

I have a snowflake Javascript stored procedure that takes in 2 arguments and performs runtime dynamic aggregation based on the columns selected from UI
Adding more context below
1 argument for selected columns (selected filter/dropdown attribute or column name sent from UI)
2 argument for the dynamic where clause prepared for the similar dropdown values selected in point 1
Data is fetched from a view and result is retrieved in this fashion
CREATE OR REPLACE PROCEDURE database.schema.sp_sample(dynamic_columns VARCHAR, dynamic_where_clause VARCHAR)
RETURNS VARCHAR
LANGUAGE JAVASCRIPT
EXECUTE AS CALLER
AS
$$
rs="Success";
try {
retrieve_queries_sql = `SELECT COL_1, COL_2, ${DYNAMIC_COLUMNS} FROM view ${DYNAMIC_WHERE_CLAUSE} GROUP BY COL_1, COL_2, ${DYNAMIC_COLUMNS}`;
var stmt = snowflake.createStatement( {sqlText: retrieve_queries_sql, binds:[DYNAMIC_COLUMNS, DYNAMIC_WHERE_CLAUSE, DYNAMIC_COLUMNS]} );
var rs = stmt.execute();
}
catch(err) {
rs= "Failed Message: "+err.message;
}
return rs;
$$;
CALL "database"."schema"."sp_sample"('COL_3','WHERE COL_3\=\'somevalue\'');
SELECT * FROM TABLE(RESULT_SCAN(LAST_QUERY_ID(-2)));
Note: There is no requirement at this point to show data specific to a certain user. Also ignore the argument interpolation part in the code above
Questions:
Is there a concern if this snowflake dynamic stored procedure is called by this UI application (multiple users concurrently selecting values on the UI and this dynamic procedure internally will be called many times) and its a read operation (select only just fetch the aggregated values for the dropdown hierarchies / levels that the users chooses)
Will there be any question of data integrity where different users will not see updated values on the UI (to provide more context data is fetched from a view within this procedure and this view data will not change during the application live / active time - all batch will be completed before the application is accessed by users). So the data is ready and no changes to the data are expected to happen when users will be accessing the application
Got a suggestion from someone that the dynamic procedure will cause data integrity concern where users will not be able to see updated data and to resolve that data integrity problem pre-aggregated views (meaning data is pre-aggregated at certain levels and kept in different views) will help and data consistency will be better.
Wanted to understand how this proposal of pre-aggregated views can help to solve if ever the problem is of data integrity (in the first place why there will be a data integrity concern) because what was done in the procedure is fetching the data from a view (dynamically aggregated) and even in this latter approach we are still creating many other views (pre-aggregated) eventually we are getting data from views in both cases, How does the latter cannot cause a data integrity concern?
Please share some feedback if the points provided makes any sense. Would like to know better opinions

No, there is no concern. The LAST_QUERY_ID returns the ID of a specified query in the current session.
https://docs.snowflake.com/en/sql-reference/functions/last_query_id.html
No, there won't be any question of data integrity - It's not different then executing a complex SQL directly. You may check isolation level:
https://docs.snowflake.com/en/sql-reference/transactions.html#isolation-level
I didn't understand how it helps to data integrity. In fact, if you use this approach, you may have data integrity issues if you combine this pre-aggregated data with the up-to-date data. Maybe they are the same person who thinks Snowflake is a Hadoop variant? :)

Related

In this situation, best practice to create an ORACLE VIEW or TABLE?

In order to be able to report out oracle logons I have a query to find logons to the system. I want to be able to output the query results to a table or view in order to then report on this table/view. The underlying tables that the query is based on do not keep historical data hence my need for a new table/view.
Query will run 3 times a day to gather logon information at those times.
Is it best to create a new table and append the daily information updates or would a view be better practice? I'm unsure on the updating of a view as the underlying tables needs to be present, if that's correct.
Thanks
A view won't help at all. It is just a stored query and doesn't contain any data; it just reflects what you have in underlying table(s).
Therefore, you'll need a "history" table which will permanently hold data you're interested in.

Stored procedure to update different columns

I have an API that i'm trying to read that gives me just the updated field. I'm trying to take that and update my tables using a stored procedure. So far the only way I have been able to figure out how to do this is with dynamic SQL but i would prefer to not do that if there is a way not to.
If it was just a couple columns, I'd just write a proc for each but we are talking about 100 fields and any of them could be updated together. One ticket might just need a timestamp updated at this time, but the next ticket might be a timestamp and who modified it while the next one might just be a note.
Everything I've read and have been taught have told me that dynamic SQL is bad and while I'll write it if I have too, I'd prefer to have a proc.
YOU CAN PERHAPS DO SOMETHING LIKE THIS:::
IF EXISTS (SELECT * FROM NEWTABLE NOT IN (SELECT * FROM OLDTABLE))
BEGIN
UPDATE OLDTABLE
SET OLDTABLE.OLDRECORDS = NEWTABLE.NEWRECORDS
WHERE OLDTABLE.PRIMARYKEY= NEWTABLE.PRIMARYKEY
END
The best way to solve your problem is using MERGE:
Performs insert, update, or delete operations on a target table based on the results of a join with a source table. For example, you can synchronize two tables by inserting, updating, or deleting rows in one table based on differences found in the other table.
As you can see your update could be more complex but more efficient as well. Using MERGE requires some proficiency, but when you start to use it you'll use it with pleasure again and again.
I am not sure how your business logic works that determines what columns are updated at what time. If there are separate business functions that require updating different but consistent columns per function, you will probably want to have individual update statements for each function. This will ensure that each process updates only the columns that it needs to update.
On the other hand, if your API is such that you really don't know ahead of time what needs to be updated, then building a dynamic SQL query is a good idea.
Another option is to build a save proc that sets every user-configurable field. As long as the calling process has all of that data, it can call the save procedure and pass every updateable column. There is no harm in having a UPDATE MyTable SET MyCol = #MyCol with the same values on each side.
Note that even if all of the values are the same, the rowversion (or timestampcolumns) will still be updated, if present.
With our software, the tables that users can edit have a widely varying range of columns. We chose to create a single save procedure for each table that has all of the update-able columns as parameters. The calling processes (our web servers) have all the required columns in memory. They pass all of the columns on every call. This performs fine for our purposes.

SQL Views vs. MS Access queries --- Updating data affects multiple base tables

I'm interested in understanding more about using a SQL View vs. a local query in MS Access. I like the fact that a view is basically a query that is stored on the server, and local machines running Access "see" it as a table.
Due to performance reasons, I'll sometimes take a view over a query since it typically makes a form load a lot faster. However, I've run into issues where I can't update the view if I make changes in two different fields that are in different base tables. Even if the view is constructed correctly with the correct joins, etc.
Just wondering if there is a more efficient and proper way to construct a query that can be updated.
A user can never update more than one table at a time. That's a given. You need to construct your form (probably using subforms) to represent the data using either single table views, simple views that are updatable, or tables.
Subforms are basically left joins to the parent form. like
SELECT *
FROM ParentForm P
LEFT JOIN SubForm S
ON P.ParentID <~~Link Master Field
= S.ParentID <~~Link Child Field
So you can recreate your view using subforms.
If your view is too complicated to fit this mold it is probably not updatable and it probably means that the data you DO want to update are in a single table but all the rest of the info in your view are supporting information. i.e. displayed to support the user making a decision.
In this case you should make the Record Source of your form be the table/view (which is updatable) that you want to update. Then in comboboxes/listboxes/controls which support the data going into your updatable table/view you make the Row Source that of your complicated view.
No matter where the view is (in Access or on the server) if it is constructed in such a way that it is impossible to determine which record in which table should be changed, nothing else matters. YOu need to design the whole form differently.

Entity Framework Algorithm For Combining Data

This pertains to a project I am inheriting and cannot change table structure or data access model. I have been asked to optimize the algorithm being used to insert data into the database.
We have a dataset in table T. From that, we pull a set we will call A. We also query an XML feed and get a set we will call X.
If a value from X is in A, record in A must be updated to reflect data for X.record
If a value from X is not in A, X.record must be inserted into A
If a value from A is not in X, A.record must be preserved in A
X must be fully iterated through for all records, and A must be updated
All these changes need to be insert back into the database.
The algo as set up does the following:
Query XML into a LIST
foreach over the XML LIST
look up foreach.item in A via LINQ (i.e. query = from record in A where
record.GUID == foreach.item.GUID
select record)
if query.Count() == 0
insert into A (via context.AddToTableName(newTableNameObject)
else
var currentRecord = query.First()
set all properties on currentRecord = properties from foreach.item
context.SaveChanges()
I know this to be suboptimal. I tried to get A into a object (call it queryA) outside of the foreach loop in an effort to move the query to memory and not hitting the disk, but after thinking that through, I realized the database was already in memory.
Having added timer objects into the algo, it's clear that what is costing the most time is the SaveChanges() function call. In some cases it's 20ms, and in some others, in explicably, it will jump to 100ms.
I would prefer to only call the SaveChanges() one time. I can't figure out how to do that given my depth of knowledge of EF (which is thin at best) and the constraints of not being able to change the table structures and having to preserve data from A which is not in X.
Suggestions?
I don't think that you will improve the performance when using Entity framework:
The query
Loading each record by separate query is not good
You can improve the performance by loading multiple records in the same query. For example you can load small batch of records by using either || in condition or Contains (like IN in SQL). Contains is supported only by .NET 4.0.
Another improvement can be replacing the query with stored procedure and table valued parameter to pass all guids to SQL server join A with X.Guids and get results. Table valued parameters are only supported on SQL 2008 and newer.
The data modification
You don't have to should not call SaveChanges after each modification. You can call it after foreach loop and it will still work. It will pass all modifications in single transaction but you will not get any performance boost by such operation and according to this answer it can give you a significant boost.
EF doesn't support command batching and because of that each update or insert always takes separate round trip to the database. There is no way around this when using EF to modify data except implementing whole new EF ADO.NET provider (it is like starting a new project).
Again solution is reducing roundtrips by using stored procedure with table valued parameter
If your DB also uses that GUID as primary key and clustered index you have another performance decrease by reordering index after each insert = modifying data on disk.
The problem is not in algorithm but in the way you process the data and technology used to process the data. Entity framework is not a good choice for data pumps. You should go with these information to your boss because improving performance means more complicated change in your application. It is not your fault and it is not the fault of the programmer who did the application. This is a feature of EF which is not very well known and as I know it is not clearly stated in any MS best practices.

Creating an index on a view with OpenQuery

SQL Server doesn't allow creating an view with schema binding where the view query uses OpenQuery as shown below.
Is there a way or a work-around to create an index on such a view?
The best you could do would be to schedule a periodic export of the AD data you are interested in to a table.
The table could of course then have all the indexes you like. If you ran the export every 10 minutes and the possibility of getting data that is 9 minutes and 59 seconds out of date is not a problem, then your queries will be lightning fast.
The only part of concern would be managing locking and concurrency during the export time. One strategy might be to export the data into a new table and then through renames swap it into place. Another might be to use SYNONYMs (SQL 2005 and up) to do something similar where you just point the SYNONYM to two alternating tables.
The data that supplies the query you're performing comes from a completely different system outside of SQL Server. There's no way that SQL Server can create an indexed view on data it does not own. For starters, how would it be notified when something had been changed so it could update its indexes? There would have to be some notification and update mechanism, which is implausible because SQL Server could not reasonably maintain ACID for such a distributed, slow, non-SQL server transaction to an outside system.
Thus my suggestion for mimicking such a thing through your own scheduled jobs that refresh the data every X minutes.
--Responding to your comment--
You can't tell whether a new user has been added without querying. If Active Directory supports some API that generates events, I've never heard of it.
But, each time you query, you could store the greatest creation time of all the users in a table, then through dynamic SQL, query only for new users with a creation date after that. This query should theoretically be very fast as it would pull very little data across the wire. You would just have to look into what the exact AD field would be for the creation date of the user and the syntax for conditions on that field.
If managing the dynamic SQL was too tough, a very simple vbscript, VB, or .Net application could also query active directory for you on a schedule and update the database.
Here are the basics for Indexed views and thier requirements. Note what you are trying to do would probably fall in the category of a Derived Table, therefore it is not possible to create an indexed view using "OpenQuery"
This list is from http://www.sqlteam.com/article/indexed-views-in-sql-server-2000
1.View definition must always return the same results from the same underlying data.
2.Views cannot use non-deterministic functions.
3.The first index on a View must be a clustered, UNIQUE index.
4.If you use Group By, you must include the new COUNT_BIG(*) in the select list.
5.View definition cannot contain the following
a.TOP
b.Text, ntext or image columns
c.DISTINCT
d.MIN, MAX, COUNT, STDEV, VARIANCE, AVG
e.SUM on a nullable expression
f.A derived table
g.Rowset function
h.Another view
i.UNION
j.Subqueries, outer joins, self joins
k.Full-text predicates like CONTAIN or FREETEXT
l.COMPUTE or COMPUTE BY
m.Cannot include order by in view definition
In this case, there is no way for SQL Server to know of any changes (data, schema, whatever) in the remote data source. For a local table, it can use SCHEMABINDING etc to ensure the underlying tables(s) stay the same and it can track datachanges.
If you need to query the view often, then I'd use a local table that is refreshed periodically. In fact, I'd use a table anyway. AD queries are't the quickest at the best of times...

Resources