How to I relate the metrics in Datadog with execution plan operators in Flink? - apache-flink

In my case scenario, Flink is sending the metrics to Datadog. Datadog Host map is as shown below { I have no Idea why is showing me latency here }
Flink metrics are sent to localhost. The issue here is that when
flink-conf.yaml file configuration is as follows
# adding metrics
metrics.reporters: stsd , dghttp
metrics.reporter.stsd.class: org.apache.flink.metrics.statsd.StatsDReporter
metrics.reporter.stsd.host: localhost
metrics.reporter.stsd.port: 8125
# for datadog
metrics.reporter.dghttp.class: org.apache.flink.metrics.datadog.DatadogHttpReporter
metrics.reporter.dghttp.apikey: xxx
metrics.reporter.dghttp.tags: host:localhost, job_id : jobA , tm_id : task1 , operator_name : operator1
metrics.scope.operator: numRecordsIn
metrics.scope.operator : numRecordsInPerSecond
metrics.scope.operator : numRecordsOut
metrics.scope.operator : numRecordsOutPerSecond
metrics.scope.operator : latency
The issue is that Datadog is showing 163 metrics which I don't understand, which I will explain in a while
I don't understand the metrics format in datadog as it shows me metrics something like this
Now as shown in above Image
Latency is expressed in time
Number of events per second is event /sec
count is some value
So my question is that which metric is this?
Also, the execution plan of my job is something like this
How do I relate the metrics in Datadog with execution plan operators in Flink?
I have read in Flink API 1.3.2 that I can use tags, I have tried to use them in flink-conf.yaml file but I don't have complete Idea what sense they make here.
My ultimate goal is to find operator latency, number of records out and in /second at each operator in this case

There are a variety of issues here.
1. You've misconfigured the scope formats. (metrics.scope.operator)
For one the configuration doesn't make sense since you specify "metrics.scope.operator" multiple times; only the last config entry is honored.
Second, and more importantly, you have misunderstood for scope formats are used for.
Scope formats configure which context information (like the ID of the task) is included in the reported metric's name.
By setting it to a constant ("latency") you've told Flink to not include anything. As a result, the numRecordsIn metrics for every operator is reported as "latency.numRecordsIn".
I suggest to just remove your scope configuration.
2. You've misconfigured the Datadog Tags
I do not understand what you were trying to do with your tags configuration.
The tags configuration option can only be used to provide global tags, i.e. tags that are attached to every single metrics, like "Flink".
By default every metric that the Datadog reports has tags attached to it for every available scope variable available.
So, if you have an operator name A, then the numRecordsIn metric will be reported with a tag "operator_name:A".
Again, I would suggest to just remove your configuration.

Related

How to Implement Patterns to Match Brute Force Login and Port Scanning Attacks using Flink CEP

I have a use case where a large no of logs will be consumed to the apache flink CEP. My use case is to find the brute force attack and port scanning attack. The challenge here is that while in ordinary CEP we compare the value against a constant like "event" = login. In this case the Criteria is different as in the case of brute force attack we have the criteria as follows.
username is constant and event="login failure" (Delimiter the event happens 5 times within 5 minutes).
It means the logs with the login failure event is received for the same username 5 times within 5 minutes
And for port Scanning we have the following criteira.
ip address is constant and dest port is variable (Delimiter is the event happens 10 times within 1 minute). It means the logs with constant ip address is received for the 10 different ports within 1 minute.
With Flink, when you want to process the events for something like one username or one ip address in isolation, the way to do this is to partition the stream by a key, using keyBy(). The training materials in the Flink docs have a section on Keyed Streams that explains this part of the DataStream API in more detail. keyBy() is the roughly same concept as a GROUP BY in SQL, if that helps.
With CEP, if you first key the stream, then the pattern will be matched separately for each distinct value of the key, which is what you want.
However, rather than CEP, I would instead recommend Flink SQL, perhaps in combination with MATCH_RECOGNIZE, for this use case. MATCH_RECOGNIZE is a higher-level API, built on top of CEP, and it's easier to work with. In combination with SQL, the result is quite powerful.
You'll find some Flink SQL training materials and examples (including examples that use MATCH_RECOGNIZE) in Ververica's github account.
Update
To be clear, I wouldn't use MATCH_RECOGNIZE for these specific rules; neither it nor CEP is needed for this use case. I mentioned it in case you have other rules where it would be helpful. (My reason for not recommending CEP in this case is that implementing the distinct constraint might be messy.)
For example, for the port scanning case you can do something like this:
SELECT e1.ip, COUNT(DISTINCT e2.port)
FROM events e1, events e2
WHERE e1.ip = e2.ip AND timestampDiff(MINUTE, e1.ts, e2.ts) < 1
GROUP BY e1.ip HAVING COUNT(DISTINCT e2.port) >= 10;
The login case is similar, but easier.
Note that when working with streaming SQL, you should give some thought to state retention.
Further update
This query is likely to return a given IP address many times, but it's not desirable to generate multiple alerts.
This could be handled by inserting matching IP addresses into an Alert table, and only generate alerts for IPs that aren't already there.
Or the output of the SQL query could be processed by a de-duplicator implemented using the DataStream API, similar to the example in the Flink docs. If you only want to suppress duplicate alerts for some period of time, use a KeyedProcessFunction instead of a RichFlatMapFunction, and use a Timer to clear the state when it's time to re-enable alerts for a given IP.
Yet another update (concerning CEP and distinctness)
Implementing this with CEP should be possible. You'll want to key the stream by the IP address, and have a pattern that has to match within one minute.
The pattern can be roughly like this:
Pattern<Event, ?> pattern = Pattern
.<Event>begin("distinctPorts")
.where(iterative condition 1)
.oneOrMore()
.followedBy("end")
.where(iterative condition 2)
.within(1 minute)
The first iterative condition returns true if the event being added to the pattern has a distinct port from all of the previously matching events. Somewhat similar to the example here, in the docs.
The second iterative condition returns true if size("distinctPorts") >= 9 and this event also has yet another distinct port.
See this Flink Forward talk (youtube video) for a somewhat similar example at the end of the talk.
If you try this and get stuck, please ask a new question, showing us what you've tried and where you're stuck.

DataflowTemplates debezium connector issue

I am referring this document to perform mysql(installed on local machine) to pubsub data streaming using debezium connector.
My properties file looks like below
databaseName=testdb
databaseUsername=root
databaseAddress=localhost
databasePort=3306
gcpProject=GCP_project_name
databasePassword=password
whitelistedTables=instance-name.testdb.testtab
singleTopicMode=true
gcpPubsubTopicPrefix=debeziumTest
databaseManagementSystem=mysql
I have already created topic in pubsub with name "debeziumTest".
But the issue is, when i run
sudo mvn exec:java -pl cdc-embedded-connector -Dexec.args="/path/to/properties-file"
, it runs without any error:
but there is no data uploaded to pubsub.
Based on the documentation, table updates are pushed to a topic that matches this format- ${PREFIX}${DB_INSTANCE}.${DATABASE}.${TABLE}
In your case I believe you should create a topic with the name - "debeziumTestinstance-name.testdb.testtab"
This may not be the only problem based on the warnings I see in the logs you shared.
The problem seems to be with your whitelistedTables.
According to the docuumentation, you should use ${instance}.${database}.${table}, for your given example it should be whitelistedTables=testdb.databaseName.testTab (if testTab is your tablename)

Solr 3.5 indexing taking very long

We recently migrated from solr3.1 to solr3.5, we have one master and one slave configured. The master has two cores,
1) Core1 – 44555972 documents
2) Core2 – 29419244 documents
We commit every 5000 documents, but lately the commit is taking very long 15 minutes plus in some cases. What could have caused this, I have checked the logs and the only warning i can see is,
“WARNING: Use of deprecated update request parameter update.processor detected. Please use the new parameter update.chain instead, as support for update.processor will be removed in a later version.”
Memory details:
export JAVA_OPTS="$JAVA_OPTS -Xms6g -Xmx36g -XX:MaxPermSize=5g"
Solr Config:
<useCompoundFile>false</useCompoundFile>
<mergeFactor>10</mergeFactor>
<ramBufferSizeMB>32</ramBufferSizeMB>
<!-- <maxBufferedDocs>1000</maxBufferedDocs> -->
<maxFieldLength>10000</maxFieldLength>
<writeLockTimeout>1000</writeLockTimeout>
<commitLockTimeout>10000</commitLockTimeout>
Also noticed, that top command show almost 350GB of Virtual memory usage.
What could be causing this, as everything was running fine a few days back?
Do you have a large search warming query? Our commits take upto 2 mins because of search warming in place. Wondering if that is the case.
The large virtual memory usage would explain this.

How to aggregate files in Mule ESB CE

I need to aggregate a number of csv inbound files in-memory, if necessary resequencing them, on Mule ESB CE 3.2.1.
How could I implement this kind of logics?
I tried with message-chunking-aggregator-router, but it fails on startup because xsd schema does not admit such a configuration:
<message-chunking-aggregator-router timeout="20000" failOnTimeout="false" >
<expression-message-info-mapping correlationIdExpression="#[header:correlation]"/>
</message-chunking-aggregator-router>
I've also tried to attach mine correlation ids to inbound messages, then process them by a custom-aggregator, but I've found that Mule internally uses a key made up of:
Serializable key=event.getId()+event.getMessage().getCorrelationSequence();//EventGroup:264
The internal id is everytime different (also if correlation sequence is correct): this way, Mule does not use only correlation sequence as I expected and same message is processed many times.
Finally, I can re-write a custom aggregator, but I would like to use a more consolidated technique.
Thanks in advance,
Gabriele
UPDATE
I've tried with message-chunk-aggregator but it doesn't fit my requisite, as it admits duplicates.
I try to detail the scenario I need to cover:
Mule polls (on a SFTP location)
file 1 "FIXEDPREFIX_1_of_2.zip" is detected and kept in memory somewhere (as an open SFTPStream, it's ok).
Some correlation info are mantained for grouping: group, sequence, group size.
file 1 "FIXEDPREFIX_1_of_2.zip" is detected again, but cannot be inserted because would be duplicated
file 2 "FIXEDPREFIX_2_of_2.zip" is detected, and correctly added
stated that group size has been reached, Mule routes MessageCollection with the correct set of messages
About point 2., I'm lucky enough to get info from filename and put them into MuleMessage::correlation* properties, so that subsequent components could use them.
I did, but duplicates are processed the same.
Thanks again
Gabriele
Here is the right router to use with Mule 3: http://www.mulesoft.org/documentation/display/MULE3USER/Routing+Message+Processors#RoutingMessageProcessors-MessageChunkAggregator

Designing a near real time streaming backend

I have the following requirements for designing a streaming backend :
Documents are getting added # 20 docs/sec. Each doc has a timestamp field.
Searches are primarily based on timestamp range ( e.g. show me documents arrived in last 20 minutes )
Search QueriesPerSecond : 100 searches/sec
Documents older than 2 days could be continuously deleted for optimization purposes ( by a cron )
I am thinking of using Solr ( with SolrReplication/NRT ). The problem with Solr is basically frequent updates/deletes. For freshest data I will need to do commit on each update ( otherwise data wont be visible by searchers). Setting pollInterval~1 minute might kill the master/server both. NRT/SolrCloud could be one fo the options, but not very sure about their stability.
Any other approaches/suggestions based on SQL/NoSQL architectures ?
mysql + memcached. Facebook runs their entire site on these two widely-available, widely supported, free and open source packages.

Resources