Load multiple CSVs, extract file name append as column and archive SSIS

Load multiple CSVs, extract file name append as column and archive SSIS - sql-server

OK I'm using SQL Server 2012.
I have a 'Orders Folder' that contains CSV files. I need to loop through the csv files, load them into a SQL table, and then move the csv's into a 'Archive Folder'. I want to perform this in SSIS and know this is possible using a Foreach loop container, which I understand.
The CSV file has the following headings:
Customer / Item / Qty / Date
My SQL table has the following headings:
Customer / Item / Qty / Date / User
The tricky bit is the csv files contain the username for each order and are named like this: (USERNAME obviously changes)
FWD_Order_USERNAME_01_02_2016_1006_214.csv
I need to extract the USERNAME and append it to the SQL table for each csv file when it is imported - how do I do this?
Thanks,
Michael

You can populate a package variable with the username from the file name, and use a Derived Column transformation to add it as a new column in your dataflow.

You are using for each loop to process the files.
You must be storing file name in variable.
1) Create one more variable user name and using expression set value property expression(to get user name ), set property EvaluateAsExpression = true; for variable.
2) In data flow add derived column using user variable map this to user column in destination.

Related

Snowflake Data Load Wizard - File Format - how to handle null string in File Format

I am using Snowflake Data Load Wizard to upload csv file to Snowflake table. The Snowflake table structure identifies a few columns as 'NOT NULL' (non-nullable). Problem is, the wizard is treating empty strings as null and the Data Load Wizard issues the following error:
Unable to copy files into table. NULL result in a non-nullable column
File '#<...../load_data.csv', line 2, character 1 Row, 1 Column
"<TABLE_NAME>" ["PRIMARY_CONTACT_ROLE":19)]
I'm sharing my File Format parameters from the wizard:
I then updated the DDL of the table by removing the "NOT NULL" declaration of the PRIMARY_CONTACT_ROLE column, then re-create the table and this time the data load of 20K records is successful.
How do we fix the file format wizard to make SNOWFLAKE not consider empty strings as NULLS?

The option you have to set EMPTY_FIELD_AS_NULL = FALSE. Unfortunately, modifying this option is not possible in the wizard. You have to create your file format or alter your existing file format in a worksheet manually as follows:
CREATE FILE FORMAT my_csv_format
TYPE = CSV
FIELD_DELIMITER = ','
SKIP_HEADER = 1
EMPTY_FIELD_AS_NULL = FALSE;
This will cause empty strings to be not treated as NULL values but as empty strings.
The relevant documentation can be found at https://docs.snowflake.com/en/sql-reference/sql/create-file-format.html#type-csv.
Let me know if you you need a full example of how to upload a CSV file with the SnowSQL CLI.
Just to add, there are additional ways you can load your CSV into snowflake without having to specify a file format.
You can use pre-built third-party modeling tools to upload your raw file, adjust the default column types to your preference and push your modeled data back into snowflake.
My team and I are working on such a tool - Datameer, feel free to check it out here
https://www.datameer.com/upload-csv-to-snowflake/

COPY INTO with partitioned ADLS

I have a container with partitioned parquet files that I want to use with the copy into command. My directories look like the below.
ABC_PARTITIONED_ID=1 (directory)
1-snappy.parquet
2-snappy.parquet
3-snappy.parquet
4-snappy.parquet
ABC_PARTITIONED_ID=2 (directory)
1-snappy.parquet
2-snappy.parquet
3-snappy.parquet
ABC_PARTITIONED_ID=3 (directory)
1-snappy.parquet
2-snappy.parquet
....
Each partitioned directory can contain multiple parquet files. I do not have a hive partition column that matches the pattern of the directories (ID1, ID2 etc).
How do I properly use the pattern parameter in the copy into command to write to a SF table from my ADLS? I am using this https://www.snowflake.com/blog/how-to-load-terabytes-into-snowflake-speeds-feeds-and-techniques/ as an example.

I do not think that you have anything to do with the pattern parameter.
You said you do not have a hive partition column that matches the pattern of the directories. If you do not have a column to use these partitions, then they are probably not beneficial for querying the data. Maybe they were generated to help maintenance. If this is the case, ignore the partition, and read all files with the COPY command.
If you think having such a column would help, then the blog post (you mentioned) already shows how you can parse the filenames to generate the column value. Add the partition column to your table (and even you may define it as the clustering key), and run the COPY command to read all files in all partitions/directories, parse the value of the column from the file name.
For parsing the partition value, I would use this one which seems easier:
copy into TARGET_TABLE from (
select
REGEXP_SUBSTR (
METADATA$FILENAME,
'.*\ABC_PARTITIONED_ID=(.*)\/.*',
1,1,'e',1
) partitioned_column_value,
$1:column_name,
...
from #your_stage/data_folder/);

If the directory/partition name doesn't matter to you, then you can use some of the newer functions in Public Preview that support Parquet format to create the table and ingest the data. Your question on how to construct the pattern would be PATTERN='*.parquet' as all subfolders would be read.
//create file format , only required to create one time
create file format my_parquet_format
type = parquet;
//EXAMPLE CREATE AND COPY INTO FOR TABLE1
//create an empty table using this file format and location. name the table table1
create or replace table ABC
using template (
select array_agg(object_construct(*))
from table(
infer_schema(
location=>'#mystage/ABC_PARTITIONED_ROOT',
file_format=>'my_parquet_format'
)
));
//copy parquet files in folder /table1 into table TABLE1
copy into ABC from #mystage/ABC_PARTITIONED_ROOT pattern = '*.parquet' file_format=my_parquet_format match_by_column_name=case_insensitive;

This should be possible by creating a storage integration, granting access in Azure for Snowflake to access the storage location, and then creating an external stage.
Alternatively you can generate a shared access signature (SAS) token to grant Snowflake (limited) access to objects in your storage account. You can then access an external (Azure) stage that references the container using the SAS token.

Snowflake metadata provides
METADATA$FILENAME - Name of the staged data file the current row belongs to. Includes the path to the data file in the stage.
METADATA$FILE_ROW_NUMBER - Row number for each record
We could do something like this:
select $1:normal_column_1, ..., METADATA$FILENAME
FROM
'#stage_name/path/to/data/' (pattern => '.*.parquet')
limit 5;
For example: it would give something like:
METADATA$FILENAME
----------
path/to/data/year=2021/part-00020-6379b638-3f7e-461e-a77b-cfbcad6fc858.c000.snappy.parquet
we need to handle deducing the column from it. We could do a regexp_replace and get the partition value as column like this:
select
regexp_replace(METADATA$FILENAME, '.*\/year=(.*)\/.*', '\\1'
) as year
$1:normal_column_1,
FROM
'#stage_name/path/to/data/' (pattern => '.*.parquet')
limit 5;
In the above regexp, we give the partition key.
Third parameter \\1 is the regex group match number. In our case, first group match - this holds the partition value.
More detailed answer and other approaches to solve this issue is available on this stackoverflow answer

Copy data files from internal stage table to Logical tables

I am dealing with json and csv files moving from Unix/S3 bucket to Internal/External stage receptively
and I don't have any issue with json files copy from Internal/External stages to Static or logical table, where I am storing as JsonFileName, and JsonFileContent
Trying to copy to Static table ( parse_json($1) is working for JSON)
COPY INTO LogicalTable (FILE_NM, JSON_CONTENT)
from (
select METADATA$FILENAME AS FILE_NM, parse_json($1) AS JSON_CONTENT
from #$TSJsonExtStgName
)
file_format = (type='JSON' strip_outer_array = true);
I am looking for something similar for CSV, copy csv file name and csv file content from internal/external staging to Static or logical tables. Mainly looking for this to separate file copy and file loading, load may fail due number of columns mismatch, newline character, or bad data in one of the records.
If any one of below gets clarified is fine, please suggest
1) Trying to copy to Static table (METADATA$?????? not working for CSV)
select METADATA$FILENAME AS FILE_NM, METADATA$?????? AS CSV_CONTENT
from #INT_REF_CSV_UNIX_STG
2) Trying for dynamic columns (T.* not working for CSV)
SELECT METADATA$FILENAME,$1, $2, $3, T.*
FROM #INT_REF_CSV_UNIX_STG(FILE_FORMAT => CSV_STG_FILE_FORMAT)T

Regardless of whether the file is CSV or JSON, you need to make sure that your SELECT matches the table layout of the target table. I assume with your JSON, your target table is 2 columns...filename and a VARIANT column for your JSON contents. For CSV, you need to do the same thing. So, you need to do the $1, $2, etc. for each column that you want from the file...that matches your target table.
I have no idea what you are referencing with METADATA$??????, btw.
---ADDED
Based on your comment below, you have 2 options, which aren't native to a COPY INTO statement:
1) Create a Stored Procedure that looks at a table DDL and generates a COPY INTO statement that has the static columns defined and then executing the COPY INTO from within the SP.
2) Leverage an External Table. By defining an External Table with the METADATA$FILENAME and the rest of the columns, the External Table will return the CSV contents to you as JSON. From there, you can treat it in the same way you are treating your JSON tables.

how to skip dynamic header rows in ssis

Hi I have one doubt in SSIS,
I want load multiple csv files into SQL server table using SSIS package.
while loading time we need consider data from headers on wards.
Source path have 3 csv files with fixed header columns with data
but each file have file desciption and dates creation information before headers and
one file description comes 2row and headers row start from 4th row with data.
Another file description comes from 1 row and 9 row on wards have headers with data and another file will come file description from 5 row and headers row start from 7th row. Columns headers are fixed in the all csv files
Files location :
C:\test\a.csv
C:\test\b.csv
C:\test\c.csv
a.csv file data like below :
here descritpion and dates comes 2and 3 row.actual data start from 4th row onwards
descritiion:empinfromationforhydlocation
creadeddate:2018-04-20
id |name|loc
1 |a |hyd
b.csv file data like below :
here descritpion and dates comes 1and 2 row.actual data start from 9th row onwards
descritiion:empinfromationforhydlocation
creadeddate:2018-04-21
id |name|loc
10 |b |chen
c.csv file data like below :
here descritpion and comes 5 and 6 row.actual data start from 9th row onwards
descritiion:empinfromationforhydlocation
creadeddate:2018-04-21
id |name|loc
20 |c |bang
Based on above 3 file I want load data into target sql server table emp :
id | Name |Sal
1 |a |hyd
2 |b |chen
3 |c |bang
here I tried like below in the package side:
create variable :
filelocationpath: C:\test\
filename : C:\test\a.csv
drag and drop the for-each loop container :
choose the type of enumerator for-each file enumerator
directory: c:\test
variable mapping :filename configure it.
type of file: *.csv
retrieve filename: filename and extension
Inside for-each loop container I drag and drop the data-flow task
and create flat file connection, here used one of file is configure and header row skipped is 1 and used data conversion required column and configure to OLE DB destination table and create dynamic connection expression for flat-file connection to pass filename dynamically.
After executing the package 2nd file is failed due to description and dates information:
description and dates is not constantly comes fixed rows next day files
description and dates will comes with different rows
Is there possible to find dynamical how many row will skip and that count will pass in header row skip.is it possible in SSIS.
Please tell me how to achieve this task in SSIS

If you have constantly count of rows which you should skip then try to go on utube and find this video: Delete Top N Rows from Flat File in SSIS Package.
In case you still need to find that amount and you don't know it that try to write into variable the amount for useless rows and then that value paste for processing package.

Workaround
In the Flat File connection manager uncheck the read header from first row option, then go to the advanced tab and define the columns metadata manually (column name, length ...)
Within the Data Flow Task, add a script component
In the Script Component Editor, go to the Input and Output Tab and add an Output column of type boolean
In the script editor, keep checking if the first column value is equal to the column header, while this condition is not met always set the output column value to false, when the column value is equal to the column header then set the output column value for all remaining rows to True
Next to the Script component, add a Conditional split to filter row based on the generated column value (rows with False value must be ignored)

Create a new file connection with a single column for the same file.
Add a Data flow task with a transformation script component.
Attach to the script component a readwrite variable as index (skiprows on the example code) and check the first characters of each row in the process input row.
bool checkRow;
int rowCount;
public override void PreExecute()
{
base.PreExecute();
checkRow = true;
rowCount = 0;
}
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
if (checkRow)
{
rowCount++;
if (Row.Data.StartsWith("id |"))
checkRow = false;
}
}
public override void PostExecute()
{
base.PostExecute();
Variables.skiprows = rowCount;//set script variable
}
Then you just need to set your variable in the expression 'HeaderRowsToSkip' for the original flat file connection.
If the files are going to be very large, you can force the script to fail when you had found the first row (zero division for example). Add an error event and set the system variable "Propagate" to false (#[System::Propagate]=false).

SSIS ADO.NET Destination table using variable?

I am using SSDT and working on a simple SSIS package.
The Control flow:
1. A Foreach Loop Container and seek a folder exist a "importdata{}.csv" file or not.
2. If found, a script task will set variables:
- User::FullPath = (e.g C:\importdata{}.csv)
- User::varFileNameNoExt = (importdata{}) without extension.
The {} is possible in "toy","game","food".
3. Go to dataflow
The Data Flow:
1. Flat File Source with a flat file connection, the connection string is varible and mapped connection string expression.
2.ADO.NET Destination , insert data.
My question is how can i set the ADO.NET Destination [TableOrViewName] Property in variable?
Assume the table : importdatatoy,importdatagame and importdatafood is created on SQL Server.
I try to set as "dbo"."[User::varFileNameNoExt]" ,but it cannot resolve the table name on runtime.

ADO.NET Destination [TableOrViewName] parametrization can be done at Data flow level. In data flow properties, you can specify "ADO.NET Destination [TableOrViewName]".
Also specify the quotes while assigning value to variable
Eg: varFileNameNoExt = "dbo"."tableName"
But first you will need to create mapping with an existent table.

Can you post your error message? I'm thinking you won't be able to combine static text and a variable like that inside of the TableOrViewName field. Instead do the combination in a new [User::varTableName] SSIS variable and use the Advanced Properties Expression editor to set the TableOrViewName to this new SSIS variable. Have a look here.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Load multiple CSVs, extract file name append as column and archive SSIS - sql-server

You can populate a package variable with the username from the file name, and use a Derived Column transformation to add it as a new column in your dataflow.

Related

Snowflake Data Load Wizard - File Format - how to handle null string in File Format

COPY INTO with partitioned ADLS

Copy data files from internal stage table to Logical tables

how to skip dynamic header rows in ssis

SSIS ADO.NET Destination table using variable?

Categories

Resources