Parsing CEF files in snowflake - snowflake-cloud-data-platform

We have staged the log files in external stage s3.The staged log files are in CEF file format.How to parse CEF files from stage to move the data to snowflake?

If the files have a fixed format (i.e. there are record and field delimiters and each record has the same number of columns) then you can just treat it as a text file and create an appropriate file format.
If the file has a semi-structured format then you should be able to load it into a variant column - whether you can create multiple rows per file or only one depends in the file structure. If you can only create one record per file then you may run into issues with file size as a variant column has a maximum file size.
Once the data is in a variant column you should be able to process it to extract usable data from it. If there is a structure Snowflake can process (e.g. xml or json) then you can use the native capabilities. If there is no recognisable structure then you'd have to write your own parsing logic in a stored procedure.
Alternatively, you could try and find another tool that will convert your files to an xml/json format and then Snowflake can easily process those files.

Related

Processing Dependent files in Apache Camel

I have two file, one file contain images of scan documents saved in single X.img file and other is meta data file saved as X.xml which contain image length and image data offset and other data related to document.
Now I have to first read xml file and get image length and offset value from where to read image from img file. Since xml file is data file have small in size but my image file are large in size.
I am using camel for consuming files from remote server and need to process only if respective xml and img file available. Xml file and img file will have same name e.g. If my xml file is 27092018.xml then respective image file name is 27092018.img
How I can achieve this using camel ftp?
There are several ways you can do this. I would suggest you look at the Claim Check EIP which will help you achieve this.
In short the following steps will need to happen.
Read the XML file and store the relevant information in some sort of persistent storage such as a database table,NoSQL object or a cache like Hazelcast/DataGrid. The data would be the xml file name and the image ofset, size and other informaiton like you mentioned. The XML file name(27092018.xml) can be used as the key if it is unique. This is the claim check you are storing in the database.
Later when you read the XML file you can use this name with a Data Enricher to get the claim check out of the database and process the image file.Essentially you us ethe file name to look up the information in the database.
See this Wiki for more information around this pattern.

SSIS - Ultra large file (500MB) fails with OutOfMemory exception

I have a SSIS pacakge which loads data from a non-standard XML file to database table with a xml datatype column . I call it a non-standard file as it has some invalid characters like tabs in it which I remove using a script task and it has hierarchy tags in it which can be present for some keys and not for others. I had tried using XSLT, but it did not work as all the attributes (tags) appear as separate output in SSIS XML source rather than separate output. So I read the whole XML file as a single column and single row as a flat file. The package runs fine when loading small files (upto 8 Mb) but fails when the size is large. When trying to load a 500 MB file, the script task failed due to OutOfMemory error. So the file was sent in smaller chunks. Now, the script task worked for processing a 90 MB file but fails in the DFT as SSIS only reads part of the XML and not the whole file because of which the DFT fails at the destination. I adjusted MaxbufferRows to 1 and DefaultBufferSize to 100 MB from the defaults of 10000 rows and 10 MB respectively. I found that the flat file source is reading 8193 KB data (8388609 characters).
Please advice.
Note : I am running the SSIS package from Citrix. I am storing the whole xml document in a table and then using .nodes to extract the relevant information to be stored in relevant stage tables.
In your Script task, open a StreamReader to process the file incrementally, and stream it into a nvarchar(max) or XML column in SQL Server. No need to load the whole thing in SSIS.
See SqlClient Streaming Support
Edit your question to include a pared-down file, including the transformations you need to make if you need an example.

Need to map csv file to target table dynamically

I have several CSV files and have their corresponding tables (which will have same columns as that of CSVs with appropriate datatype) in the database with the same name as the CSV. So, every CSV will have a table in the database.
I somehow need to map those all dynamically. Once I run the mapping, the data from all the csv files should be transferred to the corresponding tables.I don't want to have different mappings for every CSV.
Is this possible through informatica?
Appreciate your help.
PowerCenter does not provide such feature out-of-the-box. Unless the structures of the source files and target tables are the same, you need to define separate source/target definitions and create mappings that use them.
However, you can use Stage Mapping Generator to generate a mapping for each file automatically.
PMy understanding is you have mant CSV files with different column layouts and you need to load them into appropriate tables in the Database.
Approach 1 : If you use any RDBMS you should have have some kind of import option. Explore that route to create tables based on csv files. This is a manual task.
Approach 2: Open the csv file and write formuale using the header to generate a create tbale statement. Execute the formula result in your DB. So, you will have many tables created. Now, use informatica to read the CSV and import all the tables and load into tables.
Approach 3 : using Informatica. You need to do lot of coding to create a dynamic mapping on the fly.
Proposed Solution :
mapping 1 :
1. Read the CSV file pass the header information to a java transformation
2. The java transformation should normalize and split the header column into rows. you can write them to a text file
3. Now you have all the columns in a text file. Read this text file and use SQL transformation to create the tables on the database
Mapping 2
Now, the table is available you need to read the CSV file excluding the header and load the data into the above table via SQL transformation ( insert statement) created by mapping 1
you can follow this approach for all the CSV files. I haven't tried this solution at my end but, i am sure that the above approach would work.
If you're not using any transformations, its wise to use Import option of the database. (e.g bteq script in Teradata). But if you are doing transformations, then you have to create as many Sources and targets as the number of files you have.
On the other hand you can achieve this in one mapping.
1. Create a separate flow for every file(i.e. Source-Transformation-Target) in the single mapping.
2. Use target load plan for choosing which file gets loaded first.
3. Configure the file names and corresponding database table names in the session for that mapping.
If all the mappings (if you have to create them separately) are same, use Indirect file Method. In the session properties under mappings tab, source option.., you will get this option. Default option will be Direct change it to Indirect.
I dont hav the tool now to explore more and clearly guide you. But explore this Indirect File Load type in Informatica. I am sure that this will solve the requirement.
I have written a workflow in Informatica that does it, but some of the complex steps are handled inside the database. The workflow watches a folder for new files. Once it sees all the files that constitute a feed, it starts to process the feed. It takes a backup in a time stamped folder and then copies all the data from the files in the feed into an Oracle table. An Oracle procedure gets to work and then transfers the data from the Oracle table into their corresponding destination staging tables and finally the Data Warehouse. So if I have to add a new file or a feed, I have to make changes in configuration tables only. No changes are required either to the Informatica Objects or the db objects. So the short answer is yes this is possible but it is not an out of the box feature.

SSIS using too much memory to load large (40GB+) XML file into SQL Server table

I need to load a single large (40GB+) XML file into an SQL Server 2012 database table using SSIS. I'm having problems because SSIS seems to be trying to load the entire document in memory instead of streaming it.
Here's more details of my SSIS package.
I've created an XML Source with the folowing properties:
Data access mode: XML file from variable (but could be XML File Location)
Variable name: variable that specifies the XML file path in my computer.
XSD location: the path to the XSD that defines the XML being read.
The XML structure is simple, with only 3 hierarchical levels:
Root element with header information
One level defining collections of objects
The leaf level defining individual objects (each with a fixed set of fields)
I need to insert one database record per leaf element, repeating fields from the higher hierarchy levels. In other words, I need to flaten the XML hierarchy.
How can I make SSIS stream load the data, instead of trying to load the entire document in memory?
The XML source always loads the entire file. It uses XmlDocument to do so (last I checked).
The only thing you can do, is to split up the file somehow, then iteratively run each piece through your data flow.
Beyond that, you're looking at creating a custom data source, which is not trivial. It also represents a serious piece of code to maintain.
There may be third-party data sources which can do this. I had to write my own about five years ago.
Have you considered processing the files in smaller chunks?
I had the same issue before so I created a script component to process that 1 big XML file into 100's of smaller XML Files then do a forloop and iterate on all of the smaller XML Files to process.
To do this you cant use a StreamReader.ReadLine because it will still do the same thing, load that very large file so instead of that use System.IO.MemoryMappedFiles which is a designed class for this scenario.
Have a look here http://msdn.microsoft.com/en-us/library/system.io.memorymappedfiles.memorymappedfile.aspx

SQL2008 Integration Services - Loading CSV files with varying file schema

I'm using SQL2008 to load sensor data in a table with Integration Services. I have to deal with hundreds of files. The problem is that the CSV files all have slightly different schemas. Each file can have a maximum of 20 data fields. All data files have these fields in common. Some files have all the fields others have some of the fields. In addition, the order of the fields can vary.
Here’s and example of what the file schemas look like.
Station Name,Station ID,LOCAL_DATE,T_1,TD_1,RH_1,CL_1,RS_1,RI_1,PR_1,RD_1,SH_1,CL_2
Station Name,Station ID,LOCAL_DATE,T_1,TD_1,RH_1,CL_1,RS_1,RI_1,PR_1,WS_1,WD_1,WSM_1,WDM_1,SH_1
Station Name,Station ID,LOCAL_DATE,T_1,TD_1,RH_1,RS_1,RI_1,PR_1,RD_1,WS_1,WD_1,WSM_1,WDM_1
Station Name,Station ID,LOCAL_DATE,T_1,RH_1,RS_1,PR_1,VI_1,PW_1,WS_1,WD_1,WSM_1
Station Name,Station ID,LOCAL_DATE,T_1,RH_1,RS_1,WS_1,WD_1,WSM_1
Station Name,Station ID,LOCAL_DATE,T_1,RH_1,RS_1,PR_1,VI_1,WS_1,WD_1,WSM_1
I’m using a Data Flow Script Task to process the data via CreateNewOutputRows() and MyOutputBuffer.AddRow(). I have a working package to load the data however it’s not reliable and robust because as I had more files the package fails because the file schema has not been defined in CreateNewOutputRows().
I'm looking for a dynamic solution that can cope with the variation in the file schema. Doeas anyone have any ideas?
Who controls the data model for the output of the sensors? If it's not you, do they know what they are doing? If they create new and inconsistent models every time they invent a new sensor, you are pretty much up the creek.
If you can influence or control the evolution of the schemas for CSV files, try to come up with a top level data architecture. In the bad old days before there were databases, files made up of records often had, as the first field of each record, a "record type". CSV files could be organized the same way. The first field of every record could indicate what type of record you are dealing with. When you get an unknown type, put it in the "bad input file" until you can maintain your software.
If that isn't dynamic enough for you, you may have to consider artificial intelligence, or looking for a different job.
Maybe the cmd command is good. in the cmd, you can use sqlserver import csv.
If the CSV files that all have identical formats use the same file name convention or if they can be separated out in some fashion you can use the ForEach Loop Container for each file schema type.
Possible way to separate out the CSV files is run a Script (in VB) in SSIS that reads the first row of the CSV file and checks for the differing types (if the column names are in the first row) and then moves the files to the appropriate folder for use in the ForEach Loop Container.

Resources