Camel File Split and Aggregate - apache-camel

I am working on processing a large file that has CSV format and I have used split to break the CSV. Here is the sample file format:
**Item1,Item2,Item3,Item4
Item1,Item5,Item7,Item2**
Here is my route information:
<route>
<from uri="file://Data/groupedDocs?preMove=staging&delete=false" />
<split streaming="true" parallelProcessing="true">
<tokenize token="\n" group="1" />
<to uri="bean:groupProcessor" />
</split>
<log message="File Sent!!!"/>
</route>
In the above code, my groupProcessor is processing the individual line from the CSV file.
The issue is, how would I know if all the records have been processed? There could be 10 or 100 records. I saw the aggregator pattern but the problem is I do not want to aggregate i.e. I am not reading all the records and dumping them in a file. I am creating a new file for each line from the CSV file. It may also be possible that some of the lines in CSV file may generate an error so for the errored out entries, I am not creating any new file. E.g. in CSV file I have 10 lines and out of them 2 throw some exception so I'll have to log these 2 as exceptions and generate 8 files for rest of the entries. At the end, I also need to keep the count of number of new files generated and errored out. Can anyone please help in here?

You can set some flag in header if there was an exception, and you can count that flags in aggregationStrategy (section Split aggregate request/reply sample). Attribute strategyRef in xml

Related

Passing a List to an sql-component query

I'm having trouble to pass a list of string I'm getting back from my bean to my sql-component query to make a call to the database.
<bean ref="fo" method="transformTo(${body})" />
So in this upper line of code I'm taking data from the body that is an xml and transform it to json.
<bean ref="fot" method="getOTs(${body})" />
Then I'm extracting the part I want from the json and return a list of string (method signature) :
public List<String> getOTs(String jsonOTs)
Now the part that isn't working (I'm getting that one parameter is expected but there are a couple each time)
<to uri="sql:insert into dbo.table_example (OT) VALUES :#body;"/>
My goal is quite simple, retrieving a list of string from my bean (working) and making and an insert into query. I have only one parameter but multiple values. Example:
INSERT INTO table_name (column_list)
VALUES
(value_list_1),
(value_list_2),
...
(value_list_n);
Example taken from here
Bulk insert
For a bulk insert, you need to set the query parameter batch to true, this way, Camel will understand that you want to insert several rows in one batch.
Here is the corresponding to endpoint in your case:
<to uri="sql:insert into dbo.table_example (OT) VALUES (#)?batch=true"/>
Miscellaneous remarks
Actually, for all the use cases that you listed above, you have no need to explicitly refer to the body.
Indeed, in the case of a bean, you could only specify the method to invoke, Camel is able to inject the body as a parameter of your method and automatically converts it into the expected type which is String in your case.
Refers to https://camel.apache.org/manual/bean-binding.html#_parameter_binding for more details.
Regarding the SQL producer, assuming that you did not change the default configuration, the proper way is to rather use the placeholder that is # by default, Camel will automatically use the content of the body as parameters of the underlying PreparedStatement.
So you should retry with:
<to uri="sql:insert into dbo.table_example (OT) VALUES (#)"/>
If you really want to explicitly refer to the body in your query, you can rather use :#${body} as next:
<to uri="sql:insert into dbo.table_example (OT) VALUES (:#${body})"/>
Misuse of named parameter
If you only use #body as you did, Camel interprets it as a named parameter so it will try to get the value from the body if it is a map by getting the value of the key body otherwise it will try to get the value of the header body but in your case, there are no such values, therefore, you end up with an error of type
Cannot find key [body] in message body or headers to use when setting named
parameter in query [insert into developers (name) values :?body;] on the exchange

SSIS Ignore Blank Lines

I get the following SSIS error message when my source file has blank lines at the end of the file. I don't care about the blank lines as they don't affect the overall goal of pumping data from a text file to a database table. I'd like to ignore this message or, if its easier, configure SSIS to ignore blanks.
<DTS:Column DTS:ID="96" DTS:IdentificationString="Flat File Source.Outputs[Flat File Source Error Output].Columns[Flat File Source Error Output Column]"/>
I found a similar question below, but the solution isn't an SSIS one, its one that preprocesses the text files which would be my least favorite solution.
SSIS Import Multiple Files Ignore blank lines
If you want to exclude records with blank values you can use the Conditional Split. Add it between you source file and your destination.
The expression can be like below :
ISNULL(Col1) && ISNULL(Col2) && ISNULL(Col3) ...
Name the output as Remove Blank Lines. When connecting your Conditional Split to your destination, SSIS will ask you what output the split component that needs to be returned. In this case chose the Conditional Split Default Output to get the entire records without the blank values.
You can enable Data Viewer before and after the conditional split to see the filtered output.

How does one split large result set from a Group By into multiple flat files?

I'm far away from an SSIS expert and I'm attempting to correct an error (unspecified in the messages) that began once I modified a variable to increase the size of the data accumulated and exported into a flat file. (Note variable was a date in the WHERE statement that limited the data returned from the SELECT.)
So in the data flow there's a GROUP BY component and I'm trying to find the appropriate component to put in between that and the flat file destination component to chop up the results. I figured there'd be something to export, say flatFile1.csv, flatFile2.csv, etc. based on a number of lines (so if I set a limit of 1-million lines and the results returned 3.5-million, I'd get 4 files with the last one containing 1/2-million lines) or perhaps a max file size with similar results.
Which component should I use from the toolbox to guarantee a manageable file size?
Is a script component the only way to be able to handle any size output? If so would it sit in between the Group By and the Flat File output components or would the script completely obviate the need for the Flat File output?

SSIS error handling: redirect rows that have zip code field more than 5 from a flat file

I have been given a task to load a simple flat file into another using ssis package. The source flat file contains a zip code field, now my task is to extract and load into another flat file that accepts only the ones with correct zip code which is 5 digit zip code , and redirect the invalid rows to a new file.
Since I am new to SSIS, any help or ideas is much appreciated.
You can add a derived column which determines the length of the field. Then you can add a conditional split based on that column. <= 5 goes the good path, > 5 goes the reject path.

Conditional ETL in Camel based on matching .md5

Looked through the docs for a way to use Camel for ETL just as in the site's examples, except with these additional conditionals based on an md5 match.
Like the camel example, myetl/myinputdir would be monitored for any new file, and if found, file of ${filename} would be processed.
Except it would first wait for ${filename}.md5 to show up, which would contain the correct md5. If ${filename}.md5 never showed up, it would simply ignore the file until it did.
And if ${filename}.md5 did show up but the md5 didn't match, it would be processed but with an error condition.
Found suggestions to use crypto for matching, but have not figured out how to ignore the file until the matching .md5 file shows up. Really, these two files need to be processed as a matched pair for everything to work properly, and they may not arrive in the input directory at the exact same millisecond. Or alternately, the md5 file might show up a few milliseconds before the data file.
You could use an aggregator to combine the two files based on their file name. If your files are suitably named, then you can use the file name (without extension) as the correlation ID. Continue the route once completionSize equals 2. If you set groupExchanges to true then in your next route step you have access to both the file to compute the hash value for and the contents of the md5 file to compare the hash value against. Or if the md5 or content file never arrived within completionTimeout you can trigger whatever action is appropriate for your scenario.

Resources