Valid partition columns for external tables - snowflake-cloud-data-platform

I am trying to create an external table with various partition columns.
It works to do the following, for instance:
create or replace external table mytable(
myday date as to_date(substr(metadata$filename, 35, 10), 'YYYY-MM-DD'))
partition by (myday)
location = #mys3stage
file_format = (type = parquet);
However, I would like to use regex_substr instead of character indexing, as I won't always have consistent character indices for all partitioning columns. I would like to do this:
create or replace external table mytable(
myday date as to_date(regexp_substr(metadata$filename, 'day=[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]'), 'day=YYYY-MM-DD'))
partition by (myday)
location = #mys3stage
file_format = (type = parquet);
This gives me an error Defining expression for partition column MYDAY is invalid. I can run the regexp_substr clause successfully in a select statement outside of the external table creation, getting the same results as the substr approach.
How can I use regex string matching in my external table partition column definition?

REGEX_SUBSTR is not a function that is on the list of currently supported partition key functions. Please use the following link to see the list of acceptable functions:
https://docs.snowflake.com/en/sql-reference/sql/create-external-table.html#partitioning-parameters
I am not completely sure I understand how your folder structure wouldn't be consistent. Perhaps if you provided an example, this community can offer a more concise response. However, if you are unable to come up with a parsing mechanism that works, perhaps you could leverage the CASE statement to handle each unique folder structure that you may come across in your environment.

Related

Snowflake CHANGES | Why does it need to perform a self join? Why is it slower than join using other unique column?

I was facing issues with merge statement over large tables.
The source table for merge is basically clone of the target table after applying some DML.
e.g. In the below example PUBLIC.customer is target and STAGING.customer is the source.
CREATE OR REPLACE TABLE STAGING.customer CLONE PUBLIC.customer;
MERGE INTO STAGING.customer TARGET USING (SELECT * FROM NEW_CUSTOMER) AS SOURCE ON TARGET.ID = SOURCE.ID
WHEN MATCHED AND SOURCE.DELETEFLAG=TRUE THEN DELETE
WHEN MATCHED AND TARGET.ROWMODIFIED < SOURCE.ROWMODIFIED THEN UPDATE SET TARGET.AGE = SOURCE.AGE, ...
WHEN NOT MATCHED THEN INSERT (AGE) VALUES (AGE, DELETEFLAG, ID,...);
Currently, we are simply merging the STAGING.customer back to PUBLIC.customer at the end.
This final merge statement is very costly for some of the large tables.
While looking for a solution to reduce the cost, I discovered Snowflake "CHANGES" mechanism. As per the documentation,
Currently, at least one of the following must be true before change tracking metadata is recorded for a table:
Change tracking is enabled on the table (using ALTER TABLE … CHANGE_TRACKING = TRUE).
A stream is created for the table (using CREATE STREAM).
Both options add hidden columns to the table which store change tracking metadata. The columns consume a small amount of storage.
I assumed that the metadata added to the table is equivalent to the result-set of the select statement using "changes" clause, which doesn't seem to be the case.
INSERT INTO PUBLIC.CUSTOMER(AGE,...) (SELECT AGE,... FROM STAGING.CUSTOMER CHANGES (information => default) at(timestamp => 1675772176::timestamp) where "METADATA$ACTION" = 'INSERT' );
The select statement using "changes" clause is way slower than the merge statement that I am using currently.
I checked the execution plan and found that Snowflake performs a self-join(sort of) on the table at two different timestamp.
Should it really be the behaviour or am I missing something here? I was hoping to get better performance assuming to scan the table one time and then simply inserting the new records which should be faster than the merge statement.
Also, even if it does a self join, why does the merge query perform better than this, the merge query is also doing join on similar volumes.
I was also hoping to use same mechanism for delete/updates on source table.

partitioning an existing Table and using Like or Where clause with it

I'm trying to partition an existing table but I need to use Where or like clause.
The problem is I can't partition it cause every query I wrote, I get an error.
My last code is:
select *
into agg partition by range (subscriberid)
(subscriberid values LIKE '%42' on seg1)
from ncs_sub_unsub;
I've tried a lot of sources to get the answer how I can use a WHERE clause and the partition
Note: the Where and like clause is an initial part of the partitioning
My table : ncs_sub_unsub
The uniqure identifier is : subscriberid

SQL Server index usage and implicit conversion

Can someone please explain the following behavior to me, let me know if table definitions etc. would help.
I have a query executed on SQL Server 2016 SP2, all tables have the default clustered index on the primary key column, which is an IDENTITY column:
SELECT a.smallint_col
FROM TableA a
INNER JOIN TableB b ON b.int_col = a.int_col
WHERE a.int_col = 123 AND b.varchar_col = 12345;
This query returns an error:
Conversion failed when converting the varchar value 'ABC123' to data type int.
'ABC123' is the value of a row in TableB.varchar_col.
I understand this query forces SQL Server to perform implicit conversion on TableB.varchar_col because it is passed without single quotes.
I can see from Include Live Query Statistics that this query is trying to use a non-clustered index scan on an index defined on TableB as:
CREATE NONCLUSTERED INDEX [varchar_col]
ON [TableB] ([varchar_col])
and an index seek on an index defined on TableA as:
CREATE NONCLUSTERED INDEX [int_col]
ON [TableA] ([int_col])
If I force the query to use the clustered indexes on each table by using WITH (INDEX(1)), the query returns successfully. I know that if I quote the value correctly '12345', the query also returns successfully (for some unknown reason our code passes it without the quotes) and I think that is the real solution.
However, I'd like to understand the behavior of SQL Server here. Why is the clustered index scan able to perform the implicit conversion without throwing the error but the non clustered index scan can't?
Likely what's happening here is that the order at which SQL Server is applying the clauses.
When you get the failure, the data engine is likely applying the clause b.varchar_col = 12345 first. As a result, when a value that isn't able to be (implicitly) converted to an int is compared the query fails.
For the times it works, then likely the clause a.int_col = 123 is being evaluated first. When this is applied, any rows that remain contain values in b.varchar_col that can be implicitly converted, and thus no failure.
Like you said, however, the real solution here is to correct your application layer. Likely you should be using a parametrised query, rather than (injecting?) raw values. Then you control the datatypes:
SELECT a.smallint_col
FROM TableA a
INNER JOIN TableB b ON b.int_col = a.int_col
WHERE a.int_col = #IntParam and b.varchar_col = #VarcharParam;
In your application code, you can then define #IntParam as an int and #VarcharParam as a varchar, meaning that no implicit conversion can happen apart from perhaps if your application passes 12345 to parameter #VarcharParam.
How you parametrise your application is a different question though (seeing as we don't even know what language your application uses, we can't even provide an example, I'm afraid).

Convert Date Stored as VARCHAR into INT to compare to Date Stored as INT

I'm using SQL Server 2014. My request I believe is rather simple. I have one table containing a field holding a date value that is stored as VARCHAR, and another table containing a field holding a date value that is stored as INT.
The date value in the VARCHAR field is stored like this: 2015M01
The data value in the INT field is stored like this: 201501
I need to compare these tables against each other using EXCEPT. My thought process was to somehow extract or TRIM the "M" out of the VARCHAR value and see if it would let me compare the two. If anyone has a better idea such as using CAST to change the date formats or something feel free to suggest that as well.
I am also concerned that even extracting the "M" out of the VARCHAR may still prevent the comparison since one will still remain VARCHAR and the other is INT. If possible through a T-SQL query to convert on the fly that would be great advice as well. :)
REPLACE the string and then CONVERT to integer
SELECT A.*, B.*
FROM TableA A
INNER JOIN
(SELECT intField
FROM TableB
) as B
ON CONVERT(INT, REPLACE(A.varcharField, 'M', '')) = B.intField
Since you say you already have the query and are using EXCEPT, you can simply change the definition of that one "date" field in the query containing the VARCHAR value so that it matches the INT format of the other query. For example:
SELECT Field1, CONVERT(INT, REPLACE(VarcharDateField, 'M', '')) AS [DateField], Field3
FROM TableA
EXCEPT
SELECT Field1, IntDateField, Field3
FROM TableB
HOWEVER, while I realize that this might not be feasible, your best option, if you can make this happen, would be to change how the data in the table with the VARCHAR field is stored so that it is actually an INT in the same format as the table with the data already stored as an INT. Then you wouldn't have to worry about situations like this one.
Meaning:
Add an INT field to the table with the VARCHAR field.
Do an UPDATE of that table, setting the INT field to the string value with the M removed.
Update any INSERT and/or UPDATE stored procedures used by external services (app, ETL, etc) to do that same M removal logic on the way in. Then you don't have to change any app code that does INSERTs and UPDATEs. You don't even need to tell anyone you did this.
Update any "get" / SELECT stored procedures used by external services (app, ETL, etc) to do the opposite logic: convert the INT to VARCHAR and add the M on the way out. Then you don't have to change any app code that gets data from the DB. You don't even need to tell anyone you did this.
This is one of many reasons that having a Stored Procedure API to your DB is quite handy. I suppose an ORM can just be rebuilt, but you still need to recompile, even if all of the code references are automatically updated. But making a datatype change (or even moving a field to a different table, or even replacinga a field with a simple CASE statement) "behind the scenes" and masking it so that any code outside of your control doesn't know that a change happened, not nearly as difficult as most people might think. I have done all of these operations (datatype change, move a field to a different table, replace a field with simple logic, etc, etc) and it buys you a lot of time until the app code can be updated. That might be another team who handles that. Maybe their schedule won't allow for making any changes in that area (plus testing) for 3 months. Ok. It will be there waiting for them when they are ready. Any if there are several areas to update, then they can be done one at a time. You can even create new stored procedures to run in parallel for any updated app code to have the proper INT datatype as the input parameter. And once all references to the VARCHAR value are gone, then delete the original versions of those stored procedures.
If you want everything in the first table that is not in the second, you might consider something like this:
select t1.*
from t1
where not exists (select 1
from t2
where cast(replace(t1.varcharfield, 'M', '') as int) = t2.intfield
);
This should be close enough to except for your purposes.
I should add that you might need to include other columns in the where statement. However, the question only mentions one column, so I don't know what those are.
You could create a persisted view on the table with the char column, with a calculated column where the M is removed. Then you could JOIN the view to the table containing the INT column.
CREATE VIEW dbo.PersistedView
WITH SCHEMA_BINDING
AS
SELECT ConvertedDateCol = CONVERT(INT, REPLACE(VarcharCol, 'M', ''))
--, other columns including the PK, etc
FROM dbo.TablewithCharColumn;
CREATE CLUSTERED INDEX IX_PersistedView
ON dbo.PersistedView(<the PK column>);
SELECT *
FROM dbo.PersistedView pv
INNER JOIN dbo.TableWithIntColumn ic ON pv.ConvertedDateCol = ic.IntDateCol;
If you provide the actual details of both tables, I will edit my answer to make it clearer.
A persisted view with a computed column will perform far better on the SELECT statement where you join the two columns compared with doing the CONVERT and REPLACE every time you run the SELECT statement.
However, a persisted view will slightly slow down inserts into the underlying table(s), and will prevent you from making DDL changes to the underlying tables.
If you're looking to not persist the values via a schema-bound view, you could create a non-persisted computed column on the table itself, then create a non-clustered index on that column. If you are using the computed column in WHERE or JOIN clauses, you may see some benefit.
By way of example:
CREATE TABLE dbo.PCT
(
PCT_ID INT NOT NULL
CONSTRAINT PK_PCT
PRIMARY KEY CLUSTERED
IDENTITY(1,1)
, SomeChar VARCHAR(50) NOT NULL
, SomeCharToInt AS CONVERT(INT, REPLACE(SomeChar, 'M', ''))
);
CREATE INDEX IX_PCT_SomeCharToInt
ON dbo.PCT(SomeCharToInt);
INSERT INTO dbo.PCT(SomeChar)
VALUES ('2015M08');
SELECT SomeCharToInt
FROM dbo.PCT;
Results:

Bad int8 external representation "6*725" in Netezza

I am getting an error like "Bad int8 external representation "6*725" " in netezza while executing a stored procedure . This stored procedure takes data from a table does some transformations and load into another table.
Can any one please help me .
Thanks,
Brajendra
FYI: May be multiple answers to this question because do not have the query that you ran to get the error.
If you did a direct INSERT command like this, the order of the columns of the table from the select clause DO NOT match the order of the columns of the table from the insert clause. Most database management systems doesn't care what the order is, but Netezza does. The fact that it threw the "Bad int8" just means the first column it couldn't match in the select clause has that data type, and the data type in the insert clause has a different data type.
INSERT INTO DB1..TABLE1
SELECT * FROM DB1..TABLE2;
You can fix with one of two methods. Either change the order of the columns by dropping and recreating the table. Or use explicit column names in the INSERT INTO/SELECT command.

Resources