How to dynamically add fields to a select statement in Snowflake - snowflake-cloud-data-platform

I have a large table of data in Snowflake that contains many fields with a name prefix of 'RAW_'. In order to make my table more manageable, I wish to exclude all of these 'RAW_' fields completely from my select statement.
To do this manually is problematic for me as I normally have anywhere from 90-120 fields I would need to otherwise manually sift through and place in this select statement. It is therefore critical that I have a solution that allows me to dynamically add these fields into my main select statement that don't contain the prefix of 'RAW_' .
Here is a snippet of test data ...
insert into reviews values
('abc', 4, NULL, 1, 'NO'),
('xyz', 3, 'some', 1, 'haha'),
('lmn', 1, 'what', 4, NULL);
and this is my 'manual' solution, (noting that I would normally be adding 90-120 fields manually that don't have a 'RAW_' prefix)
select
name ,
acting_rating
from reviews;
getting me this desired output...
If I have the ability to add these non 'RAW_' fields to my select statement dynamically (noting that I'm usually needing to add 90-120 at a time), that would be great

It's not possible to exclude the columns with a prefix (yet). Snowflake supports EXCLUDE but you need to enter the exact column names.
https://docs.snowflake.com/en/sql-reference/sql/select.html#parameters
You can create views, so you don't need to write all columns every time you use a select.
You can create a stored procedure to generate the select statements for any table.

Related

Need to Add Values to Certain Items

I have a table that I need to add the same values to a whole bunch of items
(in a nut shell if the item doesn't have a UNIT of "CTN" I want to add the same values i have listed to them all)
I thought the following would work but it doesn't :(
Any idea what i am doing wrong ?
INSERT INTO ICUNIT
(UNIT,AUDTDATE,AUDTTIME,AUDTUSER,AUDTORG,CONVERSION)
VALUES ('CTN','20220509','22513927','ADMIN','AU','1')
WHERE ITEMNO In '0','etc','etc','etc'
If I understand correctly you might want to use INSERT INTO ... SELECT from original table with your condition.
INSERT INTO ICUNIT (UNIT,AUDTDATE,AUDTTIME,AUDTUSER,AUDTORG,CONVERSION)
SELECT 'CTN','20220509','22513927','ADMIN','AU','1'
FROM ICUNIT
WHERE ITEMNO In ('0','etc','etc','etc')
The query you needs starts by selecting the filtered items. So it seems something like below is your starting point
select <?> from dbo.ICUNIT as icu where icu.UNIT <> 'CTN' order by ...;
Notice the use of schema name, terminators, and table aliases - all best practices. I will guess that a given "item" can have multiple rows in this table so long as ICUNIT is unique within ITEMNO. Correct? If so, the above query won't work. So let's try slightly more complicated filtering.
select distinct icu.ITEMNO
from dbo.ICUNIT as icu
where not exists (select * from dbo.ICUNIT as ctns
where ctns.ITEMNO = icu.ITEMNO -- correlating the subquery
and ctns.UNIT = 'CTN')
order by ...;
There are other ways to do that above but that is one common way. That query will produce a resultset of all ITEMNO values in your table that do not already have a row where UNIT is "CTN". If you need to filter that for specific ITEMNO values you simply adjust the WHERE clause. If that works correctly, you can use that with your insert statement to then insert the desired rows.
insert into dbo.ICUNIT (...)
select distinct icu.ITEMNO, 'CTN', '20220509', '22513927', 'ADMIN', 'AU', '1'
from ...
;

Snowflake INSERT inconsistent behavior using ORDER BY

I'm trying to insert records into a table in a certain (and simple) order, as the table have an IDENTITY column (e.g. MyTbl (ID INT IDENTITY(1,1), Sale_Date DATE, Product_ID INT, Sales INT).
The query is quite simple (this is just a simplified example):
INSERT INTO MyTbl (Sale_Date, Product_ID, Sales)
SELECT Sale_Date, Product_ID,COUNT(*) as sales
FROM Fact_tbl
GROUP BY Sale_Date,Product_ID
ORDER BY Sale_Date,Product_ID
The expected behavior is that when I select the highest values of the identity ID column, I should see the latest Sale_Date. However, this is not the case. The order of the ID column in the table has nothing to do with the dates. To make things even worse, if I recreate the table and run the same INSERT statement again and again and again, I'm getting different order of insertion each time for the same data.
I'm getting this behavior even if I encase the query and put the ORDER BY in or out of the casing.
I never saw this behavior in any other SQL platform. Is this the expected behavior in Snowflake?
It's expected. Let me explain the reason:
AUTOINCREMENT and IDENTITY are synonymous. If either is specified for a column, Snowflake utilizes a sequence to generate the values for the column.
https://docs.snowflake.com/en/sql-reference/sql/create-table.html#optional-parameters
There is no guarantee that values from a sequence are contiguous (gap-free) or that the sequence values are assigned in a particular order. There is, in fact, no way to assign values from a sequence to rows in a specified order other than to use single-row statements (this still provides no guarantee about gaps).
https://docs.snowflake.com/en/user-guide/querying-sequences.html#sequence-semantics
With Snowflake each INSERT has completely different order than the
same INSERT that ran a couple of minutes ago
No, it should insert the data in expected order because you use "ORDER BY" clause. The issue is, the sequence values are not assigned in a particular order!
It's not easy to verify if the data is sorted when you use "INSERT/SELECT ORDER BY", unless you have access to underlying metadata. For testing, you may define clustering keys on a table that you ingested "sorted" data.
Anyway, if you want to assign IDs matching the order when inserting bulk data, you need to use ROW_NUMBER instead of using an IDENTITY column or any sequence values.
This is not expected behavior in Snowflake. However the way you insert data into your table (with the order by) doesn't affect the order in which the data is stored inside the table. You can leave the order by out in the insert, but you should include it in your select.

stored procedure can't figure out how to insert results into existing table

I have seen many posts on this but unfortunately I have not been able to use them to my benefit. Like many other I would like to take the results of a stored procedure and as part of the procedure, every time it is run have the results go into an already existing table. I cannot figure out how to do this and also how to match the columns up.
For example, the existing table has 4 columns. 2 of those columns are string and the stored procedure needs to hard code those values (a fixed name and definition for each procedure). I have 4 stored procedures and each one needs to have the 'name' and 'description' defined in it. This accounts for 2 of the 4 columns in the existing 'output' table. The remaining two columns, one needs to show the datetime the procedure was run, and the other whether the check passed or failed.
my procedure works fine, up to outputting results.
the existing output table, lets call it 'dbo.procedure_results'. it has 4 columns (checkname, description, asofdate, and pass)
there are 4 procedures, let's call them dbo.p1, dbo. p2....
each procedure when run currently outputs a file date and a total allocated storage integer (in TB). I need to define 'checkname' and 'description' for each procedure. 'asofdate' can be the datetime the procuedure was run Pass is an if statement, if the difference between returned values 1 and 3 are X, then Y, else z.
My current procedure:
ALTER PROCEDURE [dbo].[SP_DQ_SAN_Hosts_200TB_TotalCapacity]
AS
BEGIN
SELECT top 3 (CONVERT(date, day)) as Ref_Date, sum(TotalAllocated_TB)
FROM [STORAGE_DW].[dbo].[SMART_Host_Dash]
where TotalAllocated_TB >50
GROUP BY cast(Day as date)
order by CONVERT(date, day) desc
insert into dbo.dq_check_summary
exec dbo.sp_dq_san_hosts_200tb_totalcapacity
END
procedure currently outputs:
Ref_Date Allocated_TB
2017-05-29 898.868404388428
2017-04-24 1056.01000595093
2017-03-27 779.119682312012
Your goal is suspicious - much like the idea that the condition "TotalAllocated_TB > 50" corresponds to the name you gave to your procedure. But you simply change the select statement to include the literals you want. E.g.,
select top 3 'what you want for checkname' as 'check name',
'what you want for description' as 'description',
cast([day] as date) as refdate,
...
order by refdate;
Note that you can use an alias in the order by clause. Don't need to repeat the expression. Notice that you are summing and you filter the selected rows. Are you certain this logic is correct? The filter occurs logically before the summing? Be careful you understand that and that your sample data is sufficient to validate your logic. And do not write an insert statement that does not specify the columns to be inserted. Ever!
insert into dbo.dq_check_summary (col1, col2, col3, col4)
execute dbo.blah;
Since your procedure is relatively trivial, you might consider using table-valued functions rather than a procedure. That would require a different approach but might be easier (in the long run) to use generally. In fact, you could create a view to do this - which might be a better choice for someone without much tsql experience. A view can generate the resultset you desire - you then decide if you want to insert the results into table when you query the view.

Convert Date Stored as VARCHAR into INT to compare to Date Stored as INT

I'm using SQL Server 2014. My request I believe is rather simple. I have one table containing a field holding a date value that is stored as VARCHAR, and another table containing a field holding a date value that is stored as INT.
The date value in the VARCHAR field is stored like this: 2015M01
The data value in the INT field is stored like this: 201501
I need to compare these tables against each other using EXCEPT. My thought process was to somehow extract or TRIM the "M" out of the VARCHAR value and see if it would let me compare the two. If anyone has a better idea such as using CAST to change the date formats or something feel free to suggest that as well.
I am also concerned that even extracting the "M" out of the VARCHAR may still prevent the comparison since one will still remain VARCHAR and the other is INT. If possible through a T-SQL query to convert on the fly that would be great advice as well. :)
REPLACE the string and then CONVERT to integer
SELECT A.*, B.*
FROM TableA A
INNER JOIN
(SELECT intField
FROM TableB
) as B
ON CONVERT(INT, REPLACE(A.varcharField, 'M', '')) = B.intField
Since you say you already have the query and are using EXCEPT, you can simply change the definition of that one "date" field in the query containing the VARCHAR value so that it matches the INT format of the other query. For example:
SELECT Field1, CONVERT(INT, REPLACE(VarcharDateField, 'M', '')) AS [DateField], Field3
FROM TableA
EXCEPT
SELECT Field1, IntDateField, Field3
FROM TableB
HOWEVER, while I realize that this might not be feasible, your best option, if you can make this happen, would be to change how the data in the table with the VARCHAR field is stored so that it is actually an INT in the same format as the table with the data already stored as an INT. Then you wouldn't have to worry about situations like this one.
Meaning:
Add an INT field to the table with the VARCHAR field.
Do an UPDATE of that table, setting the INT field to the string value with the M removed.
Update any INSERT and/or UPDATE stored procedures used by external services (app, ETL, etc) to do that same M removal logic on the way in. Then you don't have to change any app code that does INSERTs and UPDATEs. You don't even need to tell anyone you did this.
Update any "get" / SELECT stored procedures used by external services (app, ETL, etc) to do the opposite logic: convert the INT to VARCHAR and add the M on the way out. Then you don't have to change any app code that gets data from the DB. You don't even need to tell anyone you did this.
This is one of many reasons that having a Stored Procedure API to your DB is quite handy. I suppose an ORM can just be rebuilt, but you still need to recompile, even if all of the code references are automatically updated. But making a datatype change (or even moving a field to a different table, or even replacinga a field with a simple CASE statement) "behind the scenes" and masking it so that any code outside of your control doesn't know that a change happened, not nearly as difficult as most people might think. I have done all of these operations (datatype change, move a field to a different table, replace a field with simple logic, etc, etc) and it buys you a lot of time until the app code can be updated. That might be another team who handles that. Maybe their schedule won't allow for making any changes in that area (plus testing) for 3 months. Ok. It will be there waiting for them when they are ready. Any if there are several areas to update, then they can be done one at a time. You can even create new stored procedures to run in parallel for any updated app code to have the proper INT datatype as the input parameter. And once all references to the VARCHAR value are gone, then delete the original versions of those stored procedures.
If you want everything in the first table that is not in the second, you might consider something like this:
select t1.*
from t1
where not exists (select 1
from t2
where cast(replace(t1.varcharfield, 'M', '') as int) = t2.intfield
);
This should be close enough to except for your purposes.
I should add that you might need to include other columns in the where statement. However, the question only mentions one column, so I don't know what those are.
You could create a persisted view on the table with the char column, with a calculated column where the M is removed. Then you could JOIN the view to the table containing the INT column.
CREATE VIEW dbo.PersistedView
WITH SCHEMA_BINDING
AS
SELECT ConvertedDateCol = CONVERT(INT, REPLACE(VarcharCol, 'M', ''))
--, other columns including the PK, etc
FROM dbo.TablewithCharColumn;
CREATE CLUSTERED INDEX IX_PersistedView
ON dbo.PersistedView(<the PK column>);
SELECT *
FROM dbo.PersistedView pv
INNER JOIN dbo.TableWithIntColumn ic ON pv.ConvertedDateCol = ic.IntDateCol;
If you provide the actual details of both tables, I will edit my answer to make it clearer.
A persisted view with a computed column will perform far better on the SELECT statement where you join the two columns compared with doing the CONVERT and REPLACE every time you run the SELECT statement.
However, a persisted view will slightly slow down inserts into the underlying table(s), and will prevent you from making DDL changes to the underlying tables.
If you're looking to not persist the values via a schema-bound view, you could create a non-persisted computed column on the table itself, then create a non-clustered index on that column. If you are using the computed column in WHERE or JOIN clauses, you may see some benefit.
By way of example:
CREATE TABLE dbo.PCT
(
PCT_ID INT NOT NULL
CONSTRAINT PK_PCT
PRIMARY KEY CLUSTERED
IDENTITY(1,1)
, SomeChar VARCHAR(50) NOT NULL
, SomeCharToInt AS CONVERT(INT, REPLACE(SomeChar, 'M', ''))
);
CREATE INDEX IX_PCT_SomeCharToInt
ON dbo.PCT(SomeCharToInt);
INSERT INTO dbo.PCT(SomeChar)
VALUES ('2015M08');
SELECT SomeCharToInt
FROM dbo.PCT;
Results:

joining latest of various usermetadata tags to user rows

I have a postgres database with a user table (userid, firstname, lastname) and a usermetadata table (userid, code, content, created datetime). I store various information about each user in the usermetadata table by code and keep a full history. so for example, a user (userid 15) has the following metadata:
15, 'QHS', '20', '2008-08-24 13:36:33.465567-04'
15, 'QHE', '8', '2008-08-24 12:07:08.660519-04'
15, 'QHS', '21', '2008-08-24 09:44:44.39354-04'
15, 'QHE', '10', '2008-08-24 08:47:57.672058-04'
I need to fetch a list of all my users and the most recent value of each of various usermetadata codes. I did this programmatically and it was, of course godawful slow. The best I could figure out to do it in SQL was to join sub-selects, which were also slow and I had to do one for each code.
This is actually not that hard to do in PostgreSQL because it has the "DISTINCT ON" clause in its SELECT syntax (DISTINCT ON isn't standard SQL).
SELECT DISTINCT ON (code) code, content, createtime
FROM metatable
WHERE userid = 15
ORDER BY code, createtime DESC;
That will limit the returned results to the first result per unique code, and if you sort the results by the create time descending, you'll get the newest of each.
I suppose you're not willing to modify your schema, so I'm afraid my answe might not be of much help, but here goes...
One possible solution would be to have the time field empty until it was replaced by a newer value, when you insert the 'deprecation date' instead. Another way is to expand the table with an 'active' column, but that would introduce some redundancy.
The classic solution would be to have both 'Valid-From' and 'Valid-To' fields where the 'Valid-To' fields are blank until some other entry becomes valid. This can be handled easily by using triggers or similar. Using constraints to make sure there is only one item of each type that is valid will ensure data integrity.
Common to these is that there is a single way of determining the set of current fields. You'd simply select all entries with the active user and a NULL 'Valid-To' or 'deprecation date' or a true 'active'.
You might be interested in taking a look at the Wikipedia entry on temporal databases and the article A consensus glossary of temporal database concepts.
A subselect is the standard way of doing this sort of thing. You just need a Unique Constraint on UserId, Code, and Date - and then you can run the following:
SELECT *
FROM Table
JOIN (
SELECT UserId, Code, MAX(Date) as LastDate
FROM Table
GROUP BY UserId, Code
) as Latest ON
Table.UserId = Latest.UserId
AND Table.Code = Latest.Code
AND Table.Date = Latest.Date
WHERE
UserId = #userId

Resources