How to aggregate files in Mule ESB CE - file

I need to aggregate a number of csv inbound files in-memory, if necessary resequencing them, on Mule ESB CE 3.2.1.
How could I implement this kind of logics?
I tried with message-chunking-aggregator-router, but it fails on startup because xsd schema does not admit such a configuration:
<message-chunking-aggregator-router timeout="20000" failOnTimeout="false" >
<expression-message-info-mapping correlationIdExpression="#[header:correlation]"/>
</message-chunking-aggregator-router>
I've also tried to attach mine correlation ids to inbound messages, then process them by a custom-aggregator, but I've found that Mule internally uses a key made up of:
Serializable key=event.getId()+event.getMessage().getCorrelationSequence();//EventGroup:264
The internal id is everytime different (also if correlation sequence is correct): this way, Mule does not use only correlation sequence as I expected and same message is processed many times.
Finally, I can re-write a custom aggregator, but I would like to use a more consolidated technique.
Thanks in advance,
Gabriele
UPDATE
I've tried with message-chunk-aggregator but it doesn't fit my requisite, as it admits duplicates.
I try to detail the scenario I need to cover:
Mule polls (on a SFTP location)
file 1 "FIXEDPREFIX_1_of_2.zip" is detected and kept in memory somewhere (as an open SFTPStream, it's ok).
Some correlation info are mantained for grouping: group, sequence, group size.
file 1 "FIXEDPREFIX_1_of_2.zip" is detected again, but cannot be inserted because would be duplicated
file 2 "FIXEDPREFIX_2_of_2.zip" is detected, and correctly added
stated that group size has been reached, Mule routes MessageCollection with the correct set of messages
About point 2., I'm lucky enough to get info from filename and put them into MuleMessage::correlation* properties, so that subsequent components could use them.
I did, but duplicates are processed the same.
Thanks again
Gabriele

Here is the right router to use with Mule 3: http://www.mulesoft.org/documentation/display/MULE3USER/Routing+Message+Processors#RoutingMessageProcessors-MessageChunkAggregator

Related

How to Implement Patterns to Match Brute Force Login and Port Scanning Attacks using Flink CEP

I have a use case where a large no of logs will be consumed to the apache flink CEP. My use case is to find the brute force attack and port scanning attack. The challenge here is that while in ordinary CEP we compare the value against a constant like "event" = login. In this case the Criteria is different as in the case of brute force attack we have the criteria as follows.
username is constant and event="login failure" (Delimiter the event happens 5 times within 5 minutes).
It means the logs with the login failure event is received for the same username 5 times within 5 minutes
And for port Scanning we have the following criteira.
ip address is constant and dest port is variable (Delimiter is the event happens 10 times within 1 minute). It means the logs with constant ip address is received for the 10 different ports within 1 minute.
With Flink, when you want to process the events for something like one username or one ip address in isolation, the way to do this is to partition the stream by a key, using keyBy(). The training materials in the Flink docs have a section on Keyed Streams that explains this part of the DataStream API in more detail. keyBy() is the roughly same concept as a GROUP BY in SQL, if that helps.
With CEP, if you first key the stream, then the pattern will be matched separately for each distinct value of the key, which is what you want.
However, rather than CEP, I would instead recommend Flink SQL, perhaps in combination with MATCH_RECOGNIZE, for this use case. MATCH_RECOGNIZE is a higher-level API, built on top of CEP, and it's easier to work with. In combination with SQL, the result is quite powerful.
You'll find some Flink SQL training materials and examples (including examples that use MATCH_RECOGNIZE) in Ververica's github account.
Update
To be clear, I wouldn't use MATCH_RECOGNIZE for these specific rules; neither it nor CEP is needed for this use case. I mentioned it in case you have other rules where it would be helpful. (My reason for not recommending CEP in this case is that implementing the distinct constraint might be messy.)
For example, for the port scanning case you can do something like this:
SELECT e1.ip, COUNT(DISTINCT e2.port)
FROM events e1, events e2
WHERE e1.ip = e2.ip AND timestampDiff(MINUTE, e1.ts, e2.ts) < 1
GROUP BY e1.ip HAVING COUNT(DISTINCT e2.port) >= 10;
The login case is similar, but easier.
Note that when working with streaming SQL, you should give some thought to state retention.
Further update
This query is likely to return a given IP address many times, but it's not desirable to generate multiple alerts.
This could be handled by inserting matching IP addresses into an Alert table, and only generate alerts for IPs that aren't already there.
Or the output of the SQL query could be processed by a de-duplicator implemented using the DataStream API, similar to the example in the Flink docs. If you only want to suppress duplicate alerts for some period of time, use a KeyedProcessFunction instead of a RichFlatMapFunction, and use a Timer to clear the state when it's time to re-enable alerts for a given IP.
Yet another update (concerning CEP and distinctness)
Implementing this with CEP should be possible. You'll want to key the stream by the IP address, and have a pattern that has to match within one minute.
The pattern can be roughly like this:
Pattern<Event, ?> pattern = Pattern
.<Event>begin("distinctPorts")
.where(iterative condition 1)
.oneOrMore()
.followedBy("end")
.where(iterative condition 2)
.within(1 minute)
The first iterative condition returns true if the event being added to the pattern has a distinct port from all of the previously matching events. Somewhat similar to the example here, in the docs.
The second iterative condition returns true if size("distinctPorts") >= 9 and this event also has yet another distinct port.
See this Flink Forward talk (youtube video) for a somewhat similar example at the end of the talk.
If you try this and get stuck, please ask a new question, showing us what you've tried and where you're stuck.

Debezium error, schema isn't known to this connector

I have a project using Debezium, mostly based on this example, which is then connected to an Apache Pulsar.
I have changed a few configurations. The file now looks like this:
database.history=io.debezium.relational.history.MemoryDatabaseHistory
connector.class=io.debezium.connector.mysql.MySqlConnector
offset.storage=org.apache.kafka.connect.storage.FileOffsetBackingStore
offset.storage.file.filename=offset.dat
offset.flush.interval.ms=5000
name=mysql-dbz-connector
database.hostname={ip}
database.port=3308
database.user={user}
database.password={pass}
database.dbname=database
database.server.name=test
table.whitelist=database.history_table,database.project_table
snapshot.mode=schema_only
schemas.enable=false
include.schema.changes=false
pulsar.topic=persistent://public/default/{0}
pulsar.broker.address=pulsar://{ip}:6650
database.history=io.debezium.relational.history.MemoryDatabaseHistory
As you may understand, what I'm trying to do is to monitor the history_table and the project_table modifications from the database and then write payloads onto an Apache Pulsar.
My problem is as follows. In whatever snapshot mode I use, when an offset has been written, I can't restart the Debezium without getting an error on the next database update.
Encountered change event for table database.history_table whose schema isn't known to this connector
It only happens with an existing offset.dat file. I think this is because the schema is null within the offset.dat file. Take this one for example:
¨Ìsrjava.util.HashMap⁄¡√`—F
loadFactorI thresholdxp?#wur[B¨Û¯T‡xpG{"schema":null,"payload":["mysql-dbz-connector",{"server":"test"}]}uq~U{"ts_sec":1563802215,"file":"database-bin.000005","pos":79574,"server_id":1,"event":1}x
I first suspected the schemas.enable=false or the include.schema.changes=false parameters that I used to make the JSON more concise, but their values don't change anything in the offset.dat file.
The problem lies in line database.history=io.debezium.relational.history.MemoryDatabaseHistory. The history will not survive restart. You should use FileDatabaseHistory instead of it.

SilverPOP API - Automated message report

I need to export all data from Silverpop automated message with silverpop API
apparently there are no many information on the net apart from the official guide "XML API Developer’s Guide ENGAGE"
I need to know how to:
retrieve a list of Automated message
extract / download data of selected report (for all days-not single one)
finally (again not documented in the official guide); how to programmatically export final report having set MOVE_TO_FTP=true
(the guide quotes Use the MOVE_TO_FTP parameter to retrieve the output file programmatically)
thank you very much in advance for any help in this
You can use the RawRecipientDataExport XML API export to get the following:
• One or more mailings
• One or more Mailing/Report ID combinations (for Autoresponders)
• A specific Database (optional: include related queries)
• A specific Group of Automated Messages
• An Event Date Range
• A Mailing Date Range
For automated messages, you can use the following XML tags <CAMPAIGN_ACTIVE/>, <CAMPAIGN_COMPLETED/>, and <CAMPAIGN_CANCELLED/> to retrieve active Groups of Automated Messages, retrieve completed Groups of Automated Messages, and retrieve canceled Groups of Automated Messages, respectively.
To get data for all days and not just one, you can set a date range for send dates and event dates, by putting your desired date ranges within the <SEND_DATE_START> and <SEND_DATE_END> tags and <EVENT_DATE_START> and <EVENT_DATE_END> tags. The date formats are like this: 12/02/2011 23:59:00
Hope this helps.

Performance problems and optimizations for csv file to SQL Routes

pretty new to Camel, but good fun so far.
I am using it to process csv files that are FTP:d to a particular directory, parse them and then place the data in relevant MySql db tables (all on the same host).
I have a simple CamelContext containing 4 routes, each of which process a different file type and place the parsed data in its associated table. A number of these files can appear in the FTP directory every 15 minutes (on the clock 15, 30 45, 00 minutes).
The camel context registry has only one entry, that of the dataSource (a simple Mysql datasource provided by org.mariadb.jdbc.MySQLDataSource...we are migrating over to MaraiDb)
Each of the 4 routes follow the pattern below,differing only in the "from file" name pattern, and the choice of Processor:
from("file:" + FilePath + "?include=(.*)dmm_all(.*)csv.gz&" + fileAction + "&moveFailed=" + FaultyFilePath)
.routeId("myRoute")
.unmarshal().gzip()
.unmarshal().csv()
.split().body()
.process(myProcessor)
.choice()
.when(simple("${header.myResult} != 'header removed'"))
.to("sql:" + insertQuery.trim() + "?dataSource=sqlds")
.end();
Each Processor just removes the file header (column names) and sets the body for insert into the associated MySql table with the given insert query.
For incoming files it works like a charm.
But when the application/Camel process is not up, the FTP directory can fill up with a lot of files.
Starting the application in this situation is when the performance problems arise.
I get the following exception(s) with a certain file ammount threshold:
Cause: org.springframework.jdbc.CannotGetJdbcConnectionException:
Could not get JDBC Connection;
nested exception is java.sql.SQLNonTransientConnectionException:
Could not connect to localhost:3306 : Cannot assign requested address
I have tried adding a delay on each route, and it goes smoothly, but painfully slow.
I am guessing (wildly..I might add) that there is contention between the route threads (if that is how camel works with a thread for each route?) and the single db connection for the dataSource.
Is this the case, if whats the best strategy to solve this (connection pool? or using a single shared thread for all the routes or...)?
What other strategies are there to improve performance in the Camel scenario I use above?
Very grateful for any input and feedback.
cheers.

Apache Camel - Make Aggregator 'flush'

I effectively want a flush, or a completionSize but for all the aggregations in the aggregator. Like a global completionSize.
Basically I want to make sure that every message that comes in a batch is aggregated and then have all the aggregations in that aggregator complete at once when the last one has been read.
e.g. 1000 messages arrive (the length is not known beforehand)
aggregate on correlation id into bins
A 300
B 400
C 300 (size of the bins is not known before hand)
I want the aggregator not to complete until the 1000th exchange is aggregated
thereupon I want all of the aggregations in the aggregator to complete at once
The CompleteSize applies to each aggregation, and not the aggregator as a whole unfortunately. So if I set CompleteSize( 1000 ) it will just never finish, since each aggregation has to exceed 1000 before it is 'complete'
I could get around it by building up a single Map object, but this is kind of sidestepping the correlation in aggregator2, that I would prefer to use ideally
so yeah, either global-complete-size or flushing, is there a way to do this intelligently?
one option is to simply add some logic to keep a global counter and set the Exchange.AGGREGATION_COMPLETE_ALL_GROUPS header once its reached...
Available as of Camel 2.9...You can manually complete all current aggregated exchanges by sending in a message containing the header Exchange.AGGREGATION_COMPLETE_ALL_GROUPS set to true. The message is considered a signal message only, the message headers/contents will not be processed otherwise.
I suggest to take a look at the Camel aggregator eip doc http://camel.apache.org/aggregator2, and read about the different completion conditions. And as well that special message Ben refers to you can send to signal to complete all in-flight aggregates.
If you consume from a batch consumer http://camel.apache.org/batch-consumer.html then you can use a special completion that complets when the batch is done. For example if you pickup files or rows from a JPA database table etc. Then when all messages from the batch consumer has been processed then the aggregator can signal completion for all these aggregated messages, using the completionFromBatchConsumer option.
Also if you have a copy of the Camel in Action book, then read chapter 8, section 8.2, as its all about the aggregate EIP covered in much more details.
Using Exchange.AGGREGATION_COMPLETE_ALL_GROUPS_INCLUSIVE worked for me:
from(endpoint)
.unmarshal(csvFormat)
.split(body())
.bean(CsvProcessor())
.choice()
// If all messages are processed,
// flush the aggregation
.`when`(simple("\${property.CamelSplitComplete}"))
.setHeader(Exchange.AGGREGATION_COMPLETE_ALL_GROUPS_INCLUSIVE, constant(true))
.end()
.aggregate(simple("\${body.trackingKey}"),
AggregationStrategies.bean(OrderAggregationStrategy()))
.completionTimeout(10000)

Resources