Stage command metadata$filename shows hundreds of occurences of single file - snowflake-cloud-data-platform

I have files in my stage that I want to query, as I want to include filenames in the result, I use the metadata$filename command.
My stage is an Azure ADLS GEN 2.
I have only one file matching the following regexp in my stage : .*regular.*[.]png.
When I run the command
SELECT
metadata$filename
FROM
#dev_silver_db.common.stage_bronze/DEV/BRONZE/<CENSORED>/S21/2715147 (
PATTERN => $pattern_png
)
AS t
I have 562 occurences of the same file in my result.
I thought that it was a bug from my IDE at first and double checked on Snowflake's history and this is the actual result from the request.
If I run LIST, the proper dataset (1 result only) is returned.
If I run the following command (the same with any union).
SELECT $pattern_png
UNION
SELECT
metadata$filename
FROM
#dev_silver_db.common.stage_bronze/DEV/BRONZE/<CENSORED>/S21/2715147 (
PATTERN => $pattern_png
)
AS t
I get the following result.
In my opinion, this behavior should be considered a bug, but I may have missed something.
For now I will just use TOP(1) because this is fine in my case but it may become a problem in other contextes.
Thank you in advance for your insights.

When you SELECT from a stage you are actually reading content of the file using a FILE FORMAT. When not specified CSV file format is used by default.
I think that what you're actually seeing is the metadata$filename information duplicated on every "row" that snowflake can read in your file.

Related

using pattern while loading csv file from s3 to snowflake

below copy command is not working , please correct me if something wrong.
copy into mytable from #mystage pattern='20.*csv.gz'
here i am trying to load the files which are starts with 20, there are mix of files which are having the name as 2021myfile.csv.gz, myfile202109.csv.gz, above command is not loading any files though there are files which starts with 20.
if i use pattern as pattern='.*20.*csv.gz'`` it is taking all the files which is wrong, i need to load only the files that are starts with 20`.
Thanks!
This is because the pattern clause is a Regex expression.
Try this:
copy into mytable from #mystage pattern = '[20]*.*csv.gz'
Reference: Loading Using Pattern Matching

COPY INTO with partitioned ADLS

I have a container with partitioned parquet files that I want to use with the copy into command. My directories look like the below.
ABC_PARTITIONED_ID=1 (directory)
1-snappy.parquet
2-snappy.parquet
3-snappy.parquet
4-snappy.parquet
ABC_PARTITIONED_ID=2 (directory)
1-snappy.parquet
2-snappy.parquet
3-snappy.parquet
ABC_PARTITIONED_ID=3 (directory)
1-snappy.parquet
2-snappy.parquet
....
Each partitioned directory can contain multiple parquet files. I do not have a hive partition column that matches the pattern of the directories (ID1, ID2 etc).
How do I properly use the pattern parameter in the copy into command to write to a SF table from my ADLS? I am using this https://www.snowflake.com/blog/how-to-load-terabytes-into-snowflake-speeds-feeds-and-techniques/ as an example.
I do not think that you have anything to do with the pattern parameter.
You said you do not have a hive partition column that matches the pattern of the directories. If you do not have a column to use these partitions, then they are probably not beneficial for querying the data. Maybe they were generated to help maintenance. If this is the case, ignore the partition, and read all files with the COPY command.
If you think having such a column would help, then the blog post (you mentioned) already shows how you can parse the filenames to generate the column value. Add the partition column to your table (and even you may define it as the clustering key), and run the COPY command to read all files in all partitions/directories, parse the value of the column from the file name.
For parsing the partition value, I would use this one which seems easier:
copy into TARGET_TABLE from (
select
REGEXP_SUBSTR (
METADATA$FILENAME,
'.*\ABC_PARTITIONED_ID=(.*)\/.*',
1,1,'e',1
) partitioned_column_value,
$1:column_name,
...
from #your_stage/data_folder/);
If the directory/partition name doesn't matter to you, then you can use some of the newer functions in Public Preview that support Parquet format to create the table and ingest the data. Your question on how to construct the pattern would be PATTERN='*.parquet' as all subfolders would be read.
//create file format , only required to create one time
create file format my_parquet_format
type = parquet;
//EXAMPLE CREATE AND COPY INTO FOR TABLE1
//create an empty table using this file format and location. name the table table1
create or replace table ABC
using template (
select array_agg(object_construct(*))
from table(
infer_schema(
location=>'#mystage/ABC_PARTITIONED_ROOT',
file_format=>'my_parquet_format'
)
));
//copy parquet files in folder /table1 into table TABLE1
copy into ABC from #mystage/ABC_PARTITIONED_ROOT pattern = '*.parquet' file_format=my_parquet_format match_by_column_name=case_insensitive;
This should be possible by creating a storage integration, granting access in Azure for Snowflake to access the storage location, and then creating an external stage.
Alternatively you can generate a shared access signature (SAS) token to grant Snowflake (limited) access to objects in your storage account. You can then access an external (Azure) stage that references the container using the SAS token.
Snowflake metadata provides
METADATA$FILENAME - Name of the staged data file the current row belongs to. Includes the path to the data file in the stage.
METADATA$FILE_ROW_NUMBER - Row number for each record
We could do something like this:
select $1:normal_column_1, ..., METADATA$FILENAME
FROM
'#stage_name/path/to/data/' (pattern => '.*.parquet')
limit 5;
For example: it would give something like:
METADATA$FILENAME
----------
path/to/data/year=2021/part-00020-6379b638-3f7e-461e-a77b-cfbcad6fc858.c000.snappy.parquet
we need to handle deducing the column from it. We could do a regexp_replace and get the partition value as column like this:
select
regexp_replace(METADATA$FILENAME, '.*\/year=(.*)\/.*', '\\1'
) as year
$1:normal_column_1,
FROM
'#stage_name/path/to/data/' (pattern => '.*.parquet')
limit 5;
In the above regexp, we give the partition key.
Third parameter \\1 is the regex group match number. In our case, first group match - this holds the partition value.
More detailed answer and other approaches to solve this issue is available on this stackoverflow answer

query snowflake s3 external file

I've created an S3 [external] stage and uploaded csv files into \stage*.csv folder.
I can see stage content by doing list #my_stage.
if I query the stage
select $1,$2,$3,$4,$5,$6 from #my_s3_stage it looks like I'm randomly picking up files.
So I'm trying to select from specific file by adding a pattern
PATTERN => job.csv
This returns no results.
Note: I've used snowflake for all of 5 hours so pretty new to syntax
For a pattern you can use
select t.$1, t.$2 from #mystage1 (file_format => 'myformat', pattern=>'.*data.*[.]csv.gz') t;
The pattern is a regex expression.
For a certain file you have to add the file name to the query like this:
select t.$1, t.$2 from #mystage/data1.csv.gz;
If your file format is set in your stage definition, you don't need the file format-parameter.
More info can be found here: https://docs.snowflake.com/en/user-guide/querying-stage.html

Snowflake:Pattern date search read from S3

Requirement: Need to fetch only the latest file everyday example here its 20200902 file
Example Files in S3:
#stagename/2020/09/reporting_2020_09_20200902000335.gz
#stagename/2020/09/reporting_2020_09_20200901000027.gz
Code:
select distinct metadata$filename from
#stagename/2020/09/
(file_format=>APP_SKIP_HEADER,pattern=>'.*/reporting_*20200902*.gz');
This will work no matter what the naming conventions of the files. Since your files appear to have a naming convention based on date and are one per point in time, you may not need to use the date to do this as you could use the name. You'll still want to use the result_scan approach.
I haven't found a way to get the date for a file in a stage other than using the LIST command. The docs say that FILE_NAME and FILE_ROW_NUMBER are the only available metadata in a select query. In any case, that approach reads the data, and we only want to read the metadata.
Since a LIST command is a metadata query, you'll need to query the result_scan to use a where clause.
One final issue that I ran into while working on a project: the last_modified date in the LIST command is in format that requires a somewhat long conversion expression to convert to timestamp. I made a UDF to do the conversion so that it's more readable. If you'd prefer putting the expression directly in the SQL, that's fine too.
First, create the UDF.
create or replace function LAST_MODIFIED_TO_TIMESTAMP(LAST_MODIFIED string)
returns timestamp_tz
as
$$
to_timestamp_tz(left(LAST_MODIFIED, len(LAST_MODIFIED) - 4) || ' ' || '00:00', 'DY, DD MON YYYY HH:MI:SS TZH:TZM')
$$;
Next, list the files in your stage or subdirectory of the stage.
list #stagename/2020/09/
Before running any other query in the session, run this one on the last query ID. You can of course run it any time in 24 hours if you specify the query ID explicitly.
select "name",
"size",
"md5",
"last_modified",
last_modified_to_timestamp("last_modified") LAST_MOD
from table(result_scan(last_query_id()))
order by LAST_MOD desc
limit 1

How to reference sub-objects in talend schema

So I have the following SOQL query that includes the ActivityHistories relationship of the Account object:
SELECT Id, Name, ParentId, (SELECT Description FROM ActivityHistories)
FROM Account
WHERE Name = '<some client>'
This query works just in in SOQLXplorer and returns 5 nested rows under the ActivityHistories key. In Talend, I am following the instructions from this page to access the sub-objects (although the example uses the query "up" syntax, not the query "down" syntax. My schema mapping is as follows:
The query returns the parent Account rows but not the ActivityHistory rows that are in the subquery:
Starting job GetActivities at 15:43 22/06/2016.
[statistics] connecting to socket on port XXXX
[statistics] connected
0X16000X00fQd61AAC|REI||
[statistics] disconnected
Job GetActivities ended at 15:43 22/06/2016. [exit code=0]
Is it possible to reference the subrows using Talend? If so, what is the syntax for the schema to do so? If not, how can I unpack this data in some ay to get to the Description fields for each Account? Any help is much appreciated.
Update: I have written a small python script to extract the ActivityHistory records and dump them in a file, then used a tFileInput to ingest the CSV and then continue through my process. But this seems very kludgey. Any better options out there?
I've done some debugging from the code perspective and it seems that if you specify correct column name, you will achieve the correct response. For your example, it should be: Account_ActivityHistories_records_Description
The output from tLogRow will be similar to:
00124000009gSHvAAM|Account1|tests;Lalalala
As you can see, the Description from all child elements are stored as 1 string, delimited by the semicolon. You can change the delimiter on Advanced Settings view on the SalesforceInput.
I have written a small python script (source gist here) to extract the ActivityHistory records and dump them in a file (command line argument), then used a tFileInput to ingest the CSV and then continue through my process.

Resources