Snowflake join table with stage file - snowflake-cloud-data-platform

)
I have a CSV files that have multiple columns. Sometimes it can be 2 sometimes it can be 43. I have mapped this columns to the Snowflake meta data table. I want to insert values to the target table but sometimes in the CSV files the column name can be diffrent for example subject, subject_name or subject_names. In the target table I have only one columnn for this called subjet_name. So if the column "subject" in the CSV file is null I need to check the "subject_name" column and if the "subject_name" i need to check "subject_names". Is there anyway how to check if this columns have a null values. I must add that the columns in the CSV are not always in the same place. So i can't use select $1 from #stage

Related

Copy data from one table to another with an array of structs in BigQuery

We are trying to copy data from one table to another using an INSERT INTO ... SELECT statement.
Our original table schema is as follows, with several columns including a repeated record containing 5 structs of various data types:
original table schema
We want an exact copy of this table, plus 3 new regular columns, so made an empty table with the new schema. However when using the following code the input table ends up with fewer rows overall than the original table.
insert into input_table
select column1, column2, null as newcolumn1, null as newcolumn2, null as newcolumn3,
array_agg(struct (arr.struct1, arr.struct2, arr.struct3, arr.struct4, arr.struct5)) as arrayname, column3
from original_table, unnest(arrayname) as arr
group by column1, column2, column3;
We tried the solution from this page: How to copy data from one table into another table which has a record repeated column in GCP Bigquery
but the query would error as it would treat the 5 structs within the array as arrays themselves (data type = eg. string, mode = repeated, rather than nullable/required).
The error we see says that our repeated record column "has type ARRAY<STRUCT<struct1name ARRAY, struct2name ARRAY, struct3name ARRAY, ...>> which cannot be inserted into column summary, which has type ARRAY<STRUCT<struct1name STRING, struct2name STRING, struct3name STRING, ...>> at [4:1]"
Additionally, a query to find rows that exist in the original but not in the input table returns no results.
We also need the columns in this order (cannot do a simple copy of the table and add the 3 new columns at the end).
Why are we losing rows when using the above code to do an insert into... select?
Is there a way to copy over the data in this way and retain the exact number of rows?

have a table with 1000+ columns, how to load data into snowflake table with parquet files without explicitly specify the column name

for the snowflake document, it has 3 columns with loading from a parquet file then you can use:
"copy into cities
from (select
$1:continent::varchar,
$1:country:name::varchar,
$1:country:city::variant
from #sf_tut_stage/cities. parquet);
"
If have 1000+ columns, can I not list all the columns like $1:col1, $1:col2...$1:co1000?
you may want to check out our INFER_SCHEMA function to dynamically obtain the columns/datatypes. https://docs.snowflake.com/en/sql-reference/functions/infer_schema.html
The expression column should be able to get you 95% of the way there.
select *
from table(
infer_schema(
location=>'#mystage'
, file_format=>'my_parquet_format'
)
);

How to load default colums while loading parquet file in snowflake

I am loading parquet file into snowflake using copy command. Parquet file has 10 column and Snowflake target table has 12 column( 2 with default dates - create date and update date)
SQL compilation error: Insert value list does not match column list expecting 12 but got 10
Is there any way i can load default values into snowflake table while loading the data through parquet file with less or more colums
any help would be greatly appreciated
You must give all columns in table specification.
If we use only copy into table_name from (select column_names from #stage), then stage file needs to match all column present in table.
copy into <table>(<col1>,<col2>,....<col10>) from
(select
$1:<col_1_parq_file>::<format_specifier>,
$1:<col_2_parq_file>::<format_specifier>,
$1:<col_3_parq_file>::<format_specifier>,
.
.
.
$1:<col_10_parq_file>::<format_specifier>
from #<stage_with_parquest_file>);

SQL Update master table with new table data hourly based on no match on Composite PK

Using SQL Server 2008
I have an SSIS task that downloads a CSV file from FTP and renames the file every hour. After that I'm doing a bulk insert of the data into a new table called NEWFTPDATA.
The data in this file is for the current day up to the current hour. The table has a composite primary key consisting of 4 different columns.
The next step I need to complete is, using T-SQL, compare this new table to my existing master archive table and insert any rows that do not already exist based on matching (or rather not-matching on those 4 columns)
Since I'll be downloading this file hourly (for real-time reporting) for each subsequent run there will be duplicate data which I will not want to insert into the master table to avoid duplicating data.
I've found ways to do this based off of the existence of one particular column, but I can't seem to figure out how to do it based off of 4 columns needing to match.
The workflow should be as follows
Update MASTERTABLE from NEWFTPDATA where newftpdata.column1, newftpdata.column2, newftpdata.column3, newftpdata.column4 do not exist in MASTERTABLE
Hopefully I've supplied substantial information for this question. If any further details are required please let me know. Thank you.
you can use MERGE
MERGE MasterTable as dest
using newftpdata as src
on
dest.column1 = src.column1
and
dest.column2 = src.column2
and
dest.column3 = src.column3
and
dest.column4 = src.column4
WHEN NOT MATCHED then
INSERT (column1, column2, ...)
values ( Src.column1, Src.column2,....)

Get a list of columns and widths for a specific record

I want a list of properties about a given table and for a specific record of data from that table - in one result
Something like this:
Column Name , DataLength, SchemaLengthMax
...and for only one record (based on a where filter)
So what Im thinking is something like this:
- Get a list of columns from sys.columns and also the schema-based maxlength value
- populate column names into a temp table that includes (column_name, data_length, schema_size_max)
- now loop over that temp table and for each column name, fetch the data for that column based on a specific record, then update the temp table with the length of this data
- finally, select from the temp table
sound reasonable?
Yup. That way works. Not sure if it's the best, since it involves one iteration per column along with the where condition on the source table.
Consider this, instead :
Get the candidate records into a temporary table after applying the where condition. Make sure to get a primary key. If there is no primary key, get a rowid. (assuming SQL Server 2005 or above).
Create a temporary table (Say, #RecValueLens) that has three columns : Primary_key_Value, MyColumnName, MyValueLen
Loop through the list of column names (after taking only the column names into another temporary table) and build sql statement shown in Step 4.
Insert Into #RecValueLens (Primary_Key_Value, MyColumnName, MyValueLen)
Select Max(Primary_Key_Goes_Here), Max('Column_Name_Goes_Here') as ColumnName, Len(Max(Column_Name)) as ValueMyLen From Source_Table_Goes_Here
Group By Primary_Key_Goes_Here
So, if there are 10 columns, you will have 10 insert statements. You could either insert them into a temporary table and run it as a loop. If the number of columns is few, you could concatenate all statements into a single batch.
Run the SQL Statement(s) from above. So, you have Record-wise, column-wise, Value lengths. What is left is to get the column definition.
Get the column definition from sys.columns into a temporary table and join with the #RecValueLens to get the output.
Do you want me to write it for you ?

Resources