Snowflake:Pattern date search read from S3 - snowflake-cloud-data-platform

Requirement: Need to fetch only the latest file everyday example here its 20200902 file
Example Files in S3:
#stagename/2020/09/reporting_2020_09_20200902000335.gz
#stagename/2020/09/reporting_2020_09_20200901000027.gz
Code:
select distinct metadata$filename from
#stagename/2020/09/
(file_format=>APP_SKIP_HEADER,pattern=>'.*/reporting_*20200902*.gz');

This will work no matter what the naming conventions of the files. Since your files appear to have a naming convention based on date and are one per point in time, you may not need to use the date to do this as you could use the name. You'll still want to use the result_scan approach.
I haven't found a way to get the date for a file in a stage other than using the LIST command. The docs say that FILE_NAME and FILE_ROW_NUMBER are the only available metadata in a select query. In any case, that approach reads the data, and we only want to read the metadata.
Since a LIST command is a metadata query, you'll need to query the result_scan to use a where clause.
One final issue that I ran into while working on a project: the last_modified date in the LIST command is in format that requires a somewhat long conversion expression to convert to timestamp. I made a UDF to do the conversion so that it's more readable. If you'd prefer putting the expression directly in the SQL, that's fine too.
First, create the UDF.
create or replace function LAST_MODIFIED_TO_TIMESTAMP(LAST_MODIFIED string)
returns timestamp_tz
as
$$
to_timestamp_tz(left(LAST_MODIFIED, len(LAST_MODIFIED) - 4) || ' ' || '00:00', 'DY, DD MON YYYY HH:MI:SS TZH:TZM')
$$;
Next, list the files in your stage or subdirectory of the stage.
list #stagename/2020/09/
Before running any other query in the session, run this one on the last query ID. You can of course run it any time in 24 hours if you specify the query ID explicitly.
select "name",
"size",
"md5",
"last_modified",
last_modified_to_timestamp("last_modified") LAST_MOD
from table(result_scan(last_query_id()))
order by LAST_MOD desc
limit 1

Related

How to query multiple JSON document schemas in Snowflake?

Could anyone tell me how to change the Stored Procedure in the article below to recursively expand all the attributes of a json file (multiple JSON document schemas)?
https://support.snowflake.net/s/article/Automating-Snowflake-Semi-Structured-JSON-Data-Handling-part-2
Craig Warman's stored procedure posted in that blog is a great idea. I asked him if it was okay to refactor his code, and he agreed. I've used the refactored version in the field, so I know the SP well as well as how it works.
It may be possible to modify the SP to work on your JSON. It will depend on whether or not Snowflake types the JSON in your variant column. The way you have it structured, it may not type everything. You can check by running this SQL and seeing if the result set includes all the columns you need:
set VARIANT_TABLE = 'WEATHER';
set VARIANT_COLUMN = 'V';
with MAIN_TABLE as
(
select * from identifier($VARIANT_TABLE) sample (1000 rows)
)
select distinct REGEXP_REPLACE(REGEXP_REPLACE(f.path, '\\[(.+)\\]'),'[^a-zA-Z0-9]','_') AS path_name, -- This generates paths with levels enclosed by double quotes (ex: "path"."to"."element"). It also strips any bracket-enclosed array element references (like "[0]")
typeof(f.value) AS attribute_type, -- This generates column datatypes.
path_name AS alias_name -- This generates column aliases based on the path
from
MAIN_TABLE,
LATERAL FLATTEN(identifier($VARIANT_COLUMN), RECURSIVE=>true) f
where TYPEOF(f.value) != 'OBJECT'
AND NOT contains(f.path, '[');
Be sure to replace the variables to your table and column names. If this picks up the type information for the columns in your JSON, then it's possible to modify this SP to do what you need. If it doesn't but there's a way to modify the query to get it to pick up the columns, that would work too.
If it doesn't pick up the columns, based on Craig's idea I decided to write type inference for non variant (such as strings from CSV log files without type information). Try the SQL above and see what results first.

Snowflake - how to display column name using function?

I did not get much help from Snowflake documentation about how I can take give column names using snowflake functions.
I have automated report which will do calculation for the given dates. here it is
select
sum(case when logdate = to_date(dateadd('day', - 10, '2019-11-14')) then eng_fees + data_fees end) AS to_date(dateadd('day', - 10, '2019-11-14'))
from myTable
where logdate = '2019-11-04'
I am getting following output for my column name below
to_date(dateadd('day', - 10, '2019-11-14'))
100
My expected output for my column name
2019-11-04
100
how can I print expected date as column name in Snowflake?
Your statement is using an AS to name your column. Snowflake will treat that as a literal, not a calculation. In order to do what you're requesting inside a function, you'll need to use a Javascript Function, I think. This will allow you to dynamically build the SQL Statement with your calculated column name predefined.
https://docs.snowflake.net/manuals/sql-reference/udf-js.html
You can't have dynamic column names without using external functionality (or TASK scheduling).
You can create a JavaScript Stored Procedure that generates a VIEW where the column names can be set by dynamic parameters / expressions.
The normal way of handling this is to use a reporting tool that can display your fixed column result set with dynamic headers or run dynamic SQL altogether.
I see 2 paths of getting close to what you want:
1) you can simply use UNION to have your headers displayed as the first row and change the actual column aliases into 1....N - so that they provide the number of column - this is going to be the fastest and the cheapest
2) you can use dynamic SQL to generate a query that will have your alias names dynamically filled and then simply run it (and you can use to example PIVOT to construct such query)

Using a variable within the FROM section of a Log Parser Lizard IIS log query

I'm trying to speed up my Log Parser Lizard queries to IIS logs on one of our servers.
This kind of query works but it's very slow:
SELECT TOP 100 * FROM '\\myserver\c$\inetpub\logs\LogFiles\W3SVC1\u_ex*.log'
ORDER BY time DESC
If I specify today's log filename that's a lot quicker:
SELECT TOP 100 * FROM '\\myserver\c$\inetpub\logs\LogFiles\W3SVC1\u_ex190731.log'
ORDER BY time DESC
I'm trying to find a way to achieve this without having to keep changing the filename within the query to match today's date. I can't find any way of using variables or functions like strcat within the FROM section of the query.
So in simpler terms, is there any way to inject today's date into a query like this:
SELECT * FROM 'C:\test\%DATE%.txt'
I found that LogParserLizard supports inline VB.NET code, specified using <% ... %> tags, so it was just a matter of inserting the date using that syntax and it worked fine.
SELECT TOP 100 *
FROM '\\myserver\c$\inetpub\logs\LogFiles\W3SVC1\u_ex<% return DateTime.Now.ToString("yyMMdd") %>.log'
ORDER BY time DESC
or in my simplified version it would be:
SELECT * FROM 'C:\test\<% return DateTime.Now.ToString("yyMMdd") %>.txt'
--converts to `C:\test\190731.txt`

How can you create a table (or other object) that always returns the value passed to its WHERE-clause, like a mirror

There is a legacy application that uses a table to translate job names to filenames. This legacy application queries it as follows:
SELECT filename FROM aJobTable WHERE jobname = 'myJobName'
But in reality those jobnames always match the filenames (e.g. 'myJobName.job' is the jobname but also the filename) That makes this table appear unnecessary. But unfortunately, we cannot change the code of this program, and the program just needs to select it from a table.
That's actually a bit annoying. Because we do need to keep this database in sync. If a jobname is not in the table, then it cannot be used. So, as our only way out, right now we have some vbscripts to synchronize this table, adding records for each possible filename. As a result, the table just 2 columns with identical values. -- We want to get rid of this.
So, we have been dreaming about some hack that queries the data with the jobname, but just always returns the jobname again, like a copy/mirror query. Then we don't actually have to populate a table at all.
"Exploits"
The following can be configured in this legacy application. My hunch is that these may open the door for some tricks/hacks.
use of either MS Access or SQL Server (we prefer sql server)
The name of the table (e.g. aJobTable)
The name of the filename column (e.g. filename)
The name of the jobname column (e.g. jobname)
Here is what I came up with:
If I create a table-valued function mirror(a) then I get pretty close to what I want. Then I could use it like
SELECT filename FROM mirror('MyJobName.job')
But that's just not good enough, it would be if I could force it to be like
SELECT filename FROM mirror WHERE param1 = 'MyJobName.job'
Unfortunately, I don't think it's possible to call functions like that.
So, I was wondering if perhaps somebody else knows how to get it working.
So my question is: "How can you create a table (or other object) that always returns the value passed to its WHERE-clause, like a mirror."
It's kinda hard to answer not knowing the code that the application use, but if we assume it only takes strings and concatenate them without any tests whatsoever, I would assume code like this: (translated to c#)
var sql = "SELECT "+ field +" FROM "+ table +" WHERE "+ conditionColumn +" = '"+ searchValue +"'";
As this is an open door for SQL injection, and given the fact that SQL Server allows you two ways of creating an alias - value as alias and alias = value,
you can take advantage of that and try to generate an SQL statement like this:
SELECT field /* FROM table WHERE conditionColumn */ = 'searchValue'
So field should be "field /* ",
and conditionColumn should be "conditionColumn */"
table name doesn't matter, you could leave an empty string for it.

How to take apart information between hyphens in SQL Server

How would I take apart a column that contains string:
92873-987dsfkj80-2002-04-11
20392-208kj48384-2008-01-04
Data would look like this:
Filename Yes/No Key
Abidabo Yes 92873-987dsfkj80-2002-04-11
Bibiboo No 20392-208kj48384-2008-01-04
Want it to look like this:
Filename Yes/No Key
Abidabo Yes 92873-987dsfkj80-20020411
Bibiboo No 20392-208kj48384-20080104
whereby I would like to concat the dates in the end as 20020411 and 20080104. From the right side, the information is the same always. From the left it is not, otherwise I could have concatenated it. It is not an import issue.
As mentioned in the comments already, storing data like this is a bad idea. However, you can obtain the dates from those strings by using a RIGHT function like so:
SELECT RIGHT('20392-208kj48384-2008-01-04', 10)
Output:
2008-01-04
Depending on the SQLSERVER version you are using, you can use STRING_SPLIT which requieres COMPATIBILITY_LEVEL 130. You can also build your own User Defined Function to split the contents of a field and manipulate it as you need, you can find some useful examples of SPLIT functions in this thread:
Split function equivalent in T-SQL?
Assuming I'm correct and the date part is always on the right side of the string, you can simply use RIGHT and CAST to get the date (assuming, again, that the date is represented as yyyy-mm-dd):
SELECT CAST(RIGHT(YourColumn, 10) As Date)
FROM YourTable
However, Panagiotis is correct in his comment - You shouldn't store data like that. Each column in the database should hold only a single point of data, be it string, number or date.
Update following your comment and the updated question:
SELECT LEFT(YourColumn, LEN(YourColumn) - 10) + REPLACE(RIGHT(YourColumn, 10), '-', '')
FROM YourTable
will return the desired results.

Resources