Loading multiple file formats together in snowpipe

Loading multiple file formats together in snowpipe - snowflake-cloud-data-platform

We have 2 types of file format 1. Xml and 2. Json. How can we load 2 types of files together to snowpipe?

You can't load two different file formats on the same variant column on the same table using the same pipe.
You can create two different pipes and two tables and load the files based on a pattern to their respective table.
What are you trying to achieve using a single pipe?

Related

Standard for storing multiple plain-text tables (csv, tsv) in one file?

I need to store multiple tables (say, CSVs) in one file. Is there any spec or standard I can follow?
I tried queries such as "multi-csv standard" and "single file plaintext database" with little to show.
Who would be responsible for creating such a standard?
Here is one plausible option:
# TABLE colors
name,hex
red,ff0000
blue,0000ff
# TABLE users
id,name,followers
1,bob,5
2,alice,33
Why not a directory of tables?
An empty directory is not acknowledged by git
Hassle to send over network in one transaction (tar/untar or zip/unzip)
See also
https://github.com/jennybc/sanesheets/issues/3
https://www.google.com/search?q=Standard+for+storing+multiple+csv+in+one+file
CSV Standard - Multiple Tables
the dump format from sqlite is basically a sequence of commands: https://www.sqlitetutorial.net/sqlite-dump/
https://en.wikipedia.org/wiki/Flat-file_database
https://www.gnu.org/software/recutils/
https://dev.to/jcolag/recutils-the-plain-text-database-52ma
https://github.com/dbohdan/structured-text-tools#csv
Import multiple csv files into pandas and concatenate into one DataFrame

Parsing CEF files in snowflake

We have staged the log files in external stage s3.The staged log files are in CEF file format.How to parse CEF files from stage to move the data to snowflake?

If the files have a fixed format (i.e. there are record and field delimiters and each record has the same number of columns) then you can just treat it as a text file and create an appropriate file format.
If the file has a semi-structured format then you should be able to load it into a variant column - whether you can create multiple rows per file or only one depends in the file structure. If you can only create one record per file then you may run into issues with file size as a variant column has a maximum file size.
Once the data is in a variant column you should be able to process it to extract usable data from it. If there is a structure Snowflake can process (e.g. xml or json) then you can use the native capabilities. If there is no recognisable structure then you'd have to write your own parsing logic in a stored procedure.
Alternatively, you could try and find another tool that will convert your files to an xml/json format and then Snowflake can easily process those files.

How to use SSIS to compare single value in File1 header to total row count in File2 (csv)

I'm tasked with creating a step that validates reported value from an external organization with the actuals delivered from same organization.
This organization will drop 2 files on our server.
File1 contains a single numerical value
File2 contains a .csv with multiple rows.
I don't need to load the data anyplace I just want to check that the value in file1 matches the total row count in file2.
Any recommendation on how to perform this?

Since the goal is to compare values between two files, there is not need to use SSIS. You can use a simple C# script to read files using System.IO.StreamReader and store the values within two variables then compare them. Or just import both files into SQL and use an SQL query to compare values.

SSIS: add non-existent column to a CSV source

I am loading a large set (10s of thousands) of CSV files into a single staging sql server table, using standard SSIS approach.
Vast majority of source CSV files have identical column structure (order, set of columns, data types). There's around 140 columns all together.
However, in certain (<1%) cases a source file will be lacking some columns (I know exactly which columns they are, and there are three possible combinations of missing columns). This is by design i.e. this is a valid business scenario (meh).
Can I somehow create a "virtual" column (filled with NULL/empty/blank values) for a source CSV connection if (and only if) that column does not exist in the physical source CSV file?
I know I can read CSV header with a C# scripting component and create multiple source connections, and re-direct to the right data flow based on existence (or lack) of certain columns but I am hoping for a more "elegant" solution, with just single CSV data source "smart" enough to "artificially" add blank columns that are missing in the source file.
For simplicity let's assume that the full column set is:
ID;C1;C2;C3
And that C3 is missing occasionally i.e. some CSV files are:
ID;C1;C2
Any hints welcome.

No, there is no "smart" CSV data source built in to SSIS.
You are certainly going to need to use a script component, but instead of using a Script Task outside the dataflow that directs the control flow to the correct dataflow, you can simply create one dataflow that has a script component as the data source. The script component reads the CSV that is currently being imported, and if the column in question is missing, it supplies it with NULL or default values.

Need to map csv file to target table dynamically

I have several CSV files and have their corresponding tables (which will have same columns as that of CSVs with appropriate datatype) in the database with the same name as the CSV. So, every CSV will have a table in the database.
I somehow need to map those all dynamically. Once I run the mapping, the data from all the csv files should be transferred to the corresponding tables.I don't want to have different mappings for every CSV.
Is this possible through informatica?
Appreciate your help.

PowerCenter does not provide such feature out-of-the-box. Unless the structures of the source files and target tables are the same, you need to define separate source/target definitions and create mappings that use them.
However, you can use Stage Mapping Generator to generate a mapping for each file automatically.

PMy understanding is you have mant CSV files with different column layouts and you need to load them into appropriate tables in the Database.
Approach 1 : If you use any RDBMS you should have have some kind of import option. Explore that route to create tables based on csv files. This is a manual task.
Approach 2: Open the csv file and write formuale using the header to generate a create tbale statement. Execute the formula result in your DB. So, you will have many tables created. Now, use informatica to read the CSV and import all the tables and load into tables.
Approach 3 : using Informatica. You need to do lot of coding to create a dynamic mapping on the fly.
Proposed Solution :
mapping 1 :
1. Read the CSV file pass the header information to a java transformation
2. The java transformation should normalize and split the header column into rows. you can write them to a text file
3. Now you have all the columns in a text file. Read this text file and use SQL transformation to create the tables on the database
Mapping 2
Now, the table is available you need to read the CSV file excluding the header and load the data into the above table via SQL transformation ( insert statement) created by mapping 1
you can follow this approach for all the CSV files. I haven't tried this solution at my end but, i am sure that the above approach would work.

If you're not using any transformations, its wise to use Import option of the database. (e.g bteq script in Teradata). But if you are doing transformations, then you have to create as many Sources and targets as the number of files you have.
On the other hand you can achieve this in one mapping.
1. Create a separate flow for every file(i.e. Source-Transformation-Target) in the single mapping.
2. Use target load plan for choosing which file gets loaded first.
3. Configure the file names and corresponding database table names in the session for that mapping.

If all the mappings (if you have to create them separately) are same, use Indirect file Method. In the session properties under mappings tab, source option.., you will get this option. Default option will be Direct change it to Indirect.
I dont hav the tool now to explore more and clearly guide you. But explore this Indirect File Load type in Informatica. I am sure that this will solve the requirement.

I have written a workflow in Informatica that does it, but some of the complex steps are handled inside the database. The workflow watches a folder for new files. Once it sees all the files that constitute a feed, it starts to process the feed. It takes a backup in a time stamped folder and then copies all the data from the files in the feed into an Oracle table. An Oracle procedure gets to work and then transfers the data from the Oracle table into their corresponding destination staging tables and finally the Data Warehouse. So if I have to add a new file or a feed, I have to make changes in configuration tables only. No changes are required either to the Informatica Objects or the db objects. So the short answer is yes this is possible but it is not an out of the box feature.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Loading multiple file formats together in snowpipe - snowflake-cloud-data-platform

We have 2 types of file format 1. Xml and 2. Json. How can we load 2 types of files together to snowpipe?

You can't load two different file formats on the same variant column on the same table using the same pipe. You can create two different pipes and two tables and load the files based on a pattern to their respective table. What are you trying to achieve using a single pipe?

Related

Standard for storing multiple plain-text tables (csv, tsv) in one file?

Parsing CEF files in snowflake

How to use SSIS to compare single value in File1 header to total row count in File2 (csv)

SSIS: add non-existent column to a CSV source

Need to map csv file to target table dynamically

Categories

Resources