Snowflake External table requires postional columns issue - snowflake-cloud-data-platform

I am facing an issue where the extracted files columns could be changing for differents:
for example:
On day 1, file might have 3 columns : c1,c2,c3
On day 2, file might have 5 columns : c1,c3,c2,c4,c5
notice the column position of c3 in the second file.
Using "copy into" from external stage syntax in snowflake won't work as the c3 column was introduced after c1.
tried with the External table, but it also requires a positional column to work.
Has anyone worked out how to load these type files?

You are not telling anything about the format being used.
The only way to load varying columns is to load from file as a single column, eg. with FIELD_DELIMITER = NONE and split & convert into an OBJECT with each file column as an attribute.
If the first record contains the field names c1 ... cn, you can load with:
WITH
file AS (SELECT * FROM VALUES ('c1,c2,c3'), ('1,2,3'), ('11,22,33') t(REC)),
split_file AS (SELECT * FROM file CROSS JOIN LATERAL SPLIT_TO_TABLE(REC, ','))
combined_table AS (
SELECT content.SEQ - 1 REC_NO, OBJECT_AGG(headers.VALUE, content.VALUE::VARIANT) OBJ
FROM split_file content
INNER JOIN split_file headers
ON content.INDEX = headers.INDEX AND content.SEQ > 1 AND headers.SEQ = 1
GROUP BY content.SEQ
)
SELECT OBJ:c1::NUMBER c1, OBJ:c2::NUMBER c2, OBJ:c3::NUMBER c3, OBJ:c4::NUMBER c4
FROM combined_table;
The example above combines everything into a single query, but in your case you have to aggregate each file separately and INSERT (append) to the combined_table.
The reason why this works is that you can reference object attributes(columns) that are not there (eg. c4), and they will be substituted with NULL.

Related

have a table with 1000+ columns, how to load data into snowflake table with parquet files without explicitly specify the column name

for the snowflake document, it has 3 columns with loading from a parquet file then you can use:
"copy into cities
from (select
$1:continent::varchar,
$1:country:name::varchar,
$1:country:city::variant
from #sf_tut_stage/cities. parquet);
"
If have 1000+ columns, can I not list all the columns like $1:col1, $1:col2...$1:co1000?
you may want to check out our INFER_SCHEMA function to dynamically obtain the columns/datatypes. https://docs.snowflake.com/en/sql-reference/functions/infer_schema.html
The expression column should be able to get you 95% of the way there.
select *
from table(
infer_schema(
location=>'#mystage'
, file_format=>'my_parquet_format'
)
);

Is there a way to get table/column information from a snowflake?

I'm wondering if there's a way I can access Snowflake to see the column names within a table without actually using the CLI.
e.g. Using a REST API endpoint to return the columns for any/all tables in any format.
Thanks ahead of time.
edit:
My goal is to output something like this to a file :
table1 :
column1,column2,column3...
table2:
column1,column2,column3...
Not sure if this works for you or not?
create or replace table demo
(
c1 int,
c2 text,
c3 date,
c4 variant
);
desc table demo;
select
'demo' as table_name,
ARRAY_AGG("name") as column_names
from
table(result_scan(last_query_id()))
You can union such stuff and get the name of all tables and respective columns.

Snowflake - Keeping target table schema in sync with source table variant column value

I ingest data into a table source_table with AVRO data. There is a column in this table say "avro_data" which will be populated with variant data.
I plan to copy data into a structured table target_table where columns have the same name and datatype as the avro_data fields in the source table.
Example:
select avro_data from source_table
{"C1":"V1", "C2", "V2"}
This will result in
select * from target_table
------------
| C1 | C2 |
------------
| V1 | V2 |
------------
My question is when schema of the avro_data evolves and new fields get added, how can I keep schema of the target_table in sync by adding equivalent columns in the target table?
Is there anything out of the box in snowflake to achieve this or if someone has created any code to do something similar?
Here's something to get you started. It shows how to take a variant column and parse out the internal columns. This uses a table in the Snowflake sample data database, which is not always the same. You can to adjust the table name and column name.
SELECT DISTINCT regexp_replace(regexp_replace(f.path,'\\\\[(.+)\\\\]'),'(\\\\w+)','\"\\\\1\"') AS path_name, -- This generates paths with levels enclosed by double quotes (ex: "path"."to"."element"). It also strips any bracket-enclosed array element references (like "[0]")
DECODE (substr(typeof(f.value),1,1),'A','ARRAY','B','BOOLEAN','I','FLOAT','D','FLOAT','STRING') AS attribute_type, -- This generates column datatypes of ARRAY, BOOLEAN, FLOAT, and STRING only
REGEXP_REPLACE(REGEXP_REPLACE(f.path, '\\\\[(.+)\\\\]'),'[^a-zA-Z0-9]','_') AS alias_name -- This generates column aliases based on the path
FROM
"SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."JCUSTOMER",
LATERAL FLATTEN("CUSTOMER", RECURSIVE=>true) f
WHERE TYPEOF(f.value) != 'OBJECT'
AND NOT contains(f.path, '[');
This is a snippet of code modified from here: https://community.snowflake.com/s/article/Automating-Snowflake-Semi-Structured-JSON-Data-Handling. The blog author attributes credit to a colleague for this section of code.
While the current incarnation of the stored procedure will create a view from the internal columns in a variant, an alternate version could create and/or alter a table to keep it in sync with changes.

Determine if columns have duplicate values sql

I am trying to figure out a way to check if their is repeated values in rows that are shared.
Example:
HMOID Name Addon10 Addon15 Addon20
RFFF Blah img path1 img path2 img path1
For my example, I would like to check if any of the addons for RFFF have any repeated value. In my example above, 'RFFF' has two images that are the same in Addon10 and Addon20 (The images have a path. so currently, they look like
http://oc2-reatest.regalmed.local/ocupgrade52/images/NDL_SCAN_SR.PNG).
I would like to be able to do this for multiple rows. I thought the following would give me an idea how to begin:
select * from HlthPlan
Group By HMO1A, HMONM
Having COUNT(*) > 1
However, it throughs the following error:
Msg 8120, Level 16, State 1, Line 1
Column 'HlthPlan.HMOID' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.*
I am fairly new to SQL and any suggestions would be appreciated.
Don't include * for your select query. Only include the columns that you are using in GROUP BY
SELECT HMO1A, HMONM, COUNT(*) from HlthPlan
GROUP BY HMO1A, HMONM
HAVING COUNT(*) > 1;
With only three columns to check, assuming non-null values across a single row:
select * from HlthPlan
where Addon10 in (Addon15, Addon20) or Addon15 = Addon20
You can also use cross apply to pivot the values for grouping:
select HMOID, addon
from HlthPlan cross apply (
select addon
from (values (Addon01), (Addon02), (Addon03), ... (Addon20)) as pvt(addon)
) a
where addon is not null
group by HMOID, addon
having count(*) > 1;
http://rextester.com/QWIW87618
You'll get multiple rows for each HMOID where the are different groups of columns having the same value. By the way, reporting on the names of specific columns involved would add another degree of difficulty to the query.
One way you can check for this is using UNPIVOT to compare your results:
create table #hmo (hmoid varchar(6), name varchar(25), Addon10 varchar(25),
Addon15 varchar(25), addon20 varchar(25));
insert into #hmo
values ('RFFF', 'Blah','img path1', 'img path2', 'img path1');
select hmoid, name, addval, addcount = count(adds)
FROM #hmo
UNPIVOT
(
addval FOR adds IN
(addon10, addon15, addon20)
) as unpvt
group by hmoid, name, addval having count(*) > 1
Will give results:
hmoid name addval addcount
RFFF Blah img path1 2
This way you can check against every row in the table and will see any row that has any two or more columns with the same value.
This does have the potential to get tedious if you have a lot of columns; the easiest way to correct for that is to build your query dynamically using info from sys.tables and sys.columns.

How to add multiple files to SQL Server using T-SQL and adding one more column

I have two files named A11 and B22. I want to create a third file which merge the records from both of these files with adding one extra column that shows that a records belongs to which original file.
Each files contains three records as following.
A11:
Mike,50
Rocky,60
Andy,70
B22:
Kristen,80
Natasha,90
Mila,100
I want output something like this.
Output File
C33:
Mike,50,A11
Rocky,60,A11
Andy,70,A11
Kristen,80,B22
Natasha,90,B22
Mila,100,B22
Can anyone help me with how to get this desired result?
This is pretty sparse on details but I think you can do something like this.
insert C33
(
Name
, SomeValue
, SourceFile
)
select Name
, SomeValue
, 'A11'
from A11
UNION ALL
select Name
, SomeValue
, 'B22'
from B22

Resources