Processing Dependent files in Apache Camel - apache-camel

I have two file, one file contain images of scan documents saved in single X.img file and other is meta data file saved as X.xml which contain image length and image data offset and other data related to document.
Now I have to first read xml file and get image length and offset value from where to read image from img file. Since xml file is data file have small in size but my image file are large in size.
I am using camel for consuming files from remote server and need to process only if respective xml and img file available. Xml file and img file will have same name e.g. If my xml file is 27092018.xml then respective image file name is 27092018.img
How I can achieve this using camel ftp?

There are several ways you can do this. I would suggest you look at the Claim Check EIP which will help you achieve this.
In short the following steps will need to happen.
Read the XML file and store the relevant information in some sort of persistent storage such as a database table,NoSQL object or a cache like Hazelcast/DataGrid. The data would be the xml file name and the image ofset, size and other informaiton like you mentioned. The XML file name(27092018.xml) can be used as the key if it is unique. This is the claim check you are storing in the database.
Later when you read the XML file you can use this name with a Data Enricher to get the claim check out of the database and process the image file.Essentially you us ethe file name to look up the information in the database.
See this Wiki for more information around this pattern.

Related

how to phrase a single line json file in snowflake external table

this is the first time I try to load a single line json file into snowflake via external table.
the files are around 60MB and stored on s3. it contains nest records, arrays and no newline character.
Right now I cannot see the data from the external table, However, if the file is small enough like 1MB the external table works fine.
The closest solution I can find is this, but it doesn't provide me sufficient answer.
The file can be in a much bigger size and I have no control of the files.
Is there a way to fix this issue?
thanks!
Edit:
Look like the only tangible solution is to make the file smaller, as the json file is not ndjson. What JSON format does STRIP_OUTER_ARRAY support?
If your nested records are wrapped in an outer array, you can potentially use STRIP_OUTER_ARRAY on the Stage file definition to remove them so each JSON element within the outer array gets loaded as a record. You can include use METADATA$FILENAME in the table definition to derive which file the elements came from.

Parsing CEF files in snowflake

We have staged the log files in external stage s3.The staged log files are in CEF file format.How to parse CEF files from stage to move the data to snowflake?
If the files have a fixed format (i.e. there are record and field delimiters and each record has the same number of columns) then you can just treat it as a text file and create an appropriate file format.
If the file has a semi-structured format then you should be able to load it into a variant column - whether you can create multiple rows per file or only one depends in the file structure. If you can only create one record per file then you may run into issues with file size as a variant column has a maximum file size.
Once the data is in a variant column you should be able to process it to extract usable data from it. If there is a structure Snowflake can process (e.g. xml or json) then you can use the native capabilities. If there is no recognisable structure then you'd have to write your own parsing logic in a stored procedure.
Alternatively, you could try and find another tool that will convert your files to an xml/json format and then Snowflake can easily process those files.

Variable is not updated in SSIS For Each Loop

I'm trying to create simple project in which I'd like to download XML files from given website. I have stored files names in DataBase table. What I have done looking at this tutorial: Implementing Foreach Looping Logic in SSIS is:
a. Read all distinct rows from my Table (let's call it XMLTable)
b. Assign result of this query to User variable called: nameOfFileToDownload
c. Created For Each Loop container
d. Configured to assign localy each row with file name to download to: nameFileForeachLoop variable
e. Download files from concate link as a path using HTTPManager with dynamic file name from nameFileForeachLoop variable.
f. Created XMLFlatFile connection for dummy file - I assumed after reading from above tutorial.
The problem is now that this loop container works but doesn't download files separately - still to one file which at the end is empty. My nameFileForeachLoop variable is not updated during each LOOP iteration. What's more I have noticed that during FLAT FILE creation I have only CSV and TXT extension available. I have tried many approaches but without results. Can you help me how to download XML files?
For example I have following link to XML: nbp.pl/kursy/xml/c001z180102.xml What changes here is last part of this link with XML extension which I get from my XMLTable.
I have configured my components as follows:
You are on the right track, but need some amendements.
Do not create and configure Flat File Destination connection manager unless you are creating tables in .CSV or .TXT files. In provided example author selects data with dynamic queries and stores the results in dynamic txt files. As I understand, this is not your case.
Here are some examples how to download and save files with HTTP in SSIS. Sample download script and Review of different approaches to HTTP download.

How to step through excel(xlsx) file uploaded in a BLOB field mssql

I need to get values from a certain column in a xlsx spreadsheet that was uploaded to my database in a image(blob) field. I would like to step through the rows and get values from say column 4 and insert the values into another table by using sqlserver. I can to it with CSV files by casting the image field to varbinary and then cast it again to varhar and search for ','s.
Can openrowset work on a blob field?
I doubt that this can work. Even though the data in the XLSX is stored in Microsofts Office Open-XML format (http://en.wikipedia.org/wiki/Office_Open_XML) the XML is then zipped which means that your XLSX file is a binary file. So if you want to access data in the xlsx (can't you use csv instead?) I think you need to do so programmatically. Depending on the programming logic of your choice there are various open-source projects allowing you to access xlsx file.
Java: Apache POI http://poi.apache.org/spreadsheet/
C++: http://sourceforge.net/projects/xlslib/?source=directory
...

SSIS using too much memory to load large (40GB+) XML file into SQL Server table

I need to load a single large (40GB+) XML file into an SQL Server 2012 database table using SSIS. I'm having problems because SSIS seems to be trying to load the entire document in memory instead of streaming it.
Here's more details of my SSIS package.
I've created an XML Source with the folowing properties:
Data access mode: XML file from variable (but could be XML File Location)
Variable name: variable that specifies the XML file path in my computer.
XSD location: the path to the XSD that defines the XML being read.
The XML structure is simple, with only 3 hierarchical levels:
Root element with header information
One level defining collections of objects
The leaf level defining individual objects (each with a fixed set of fields)
I need to insert one database record per leaf element, repeating fields from the higher hierarchy levels. In other words, I need to flaten the XML hierarchy.
How can I make SSIS stream load the data, instead of trying to load the entire document in memory?
The XML source always loads the entire file. It uses XmlDocument to do so (last I checked).
The only thing you can do, is to split up the file somehow, then iteratively run each piece through your data flow.
Beyond that, you're looking at creating a custom data source, which is not trivial. It also represents a serious piece of code to maintain.
There may be third-party data sources which can do this. I had to write my own about five years ago.
Have you considered processing the files in smaller chunks?
I had the same issue before so I created a script component to process that 1 big XML file into 100's of smaller XML Files then do a forloop and iterate on all of the smaller XML Files to process.
To do this you cant use a StreamReader.ReadLine because it will still do the same thing, load that very large file so instead of that use System.IO.MemoryMappedFiles which is a designed class for this scenario.
Have a look here http://msdn.microsoft.com/en-us/library/system.io.memorymappedfiles.memorymappedfile.aspx

Resources