Flink: Key Group 91 does not belong to the local range - apache-flink

As title, the exception occurs in keyed windows,
java.lang.IllegalArgumentException: Key Group 91 does not belong to the local range.
at org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:139)
at org.apache.flink.streaming.api.operators.HeapInternalTimerService.getIndexForKeyGroup(HeapInternalTimerService.java:431)
at org.apache.flink.streaming.api.operators.HeapInternalTimerService.getProcessingTimeTimerSetForKeyGroup(HeapInternalTimerService.java:412)
at org.apache.flink.streaming.api.operators.HeapInternalTimerService.getProcessingTimeTimerSetForTimer(HeapInternalTimerService.java:402)
at org.apache.flink.streaming.api.operators.HeapInternalTimerService.registerProcessingTimeTimer(HeapInternalTimerService.java:194)
at org.apache.flink.streaming.runtime.operators.windowing.WindowOperator$Context.registerProcessingTimeTimer(WindowOperator.java:907)
at org.apache.flink.streaming.api.windowing.triggers.ProcessingTimeTrigger.onElement(ProcessingTimeTrigger.java:36)
at org.apache.flink.streaming.api.windowing.triggers.ProcessingTimeTrigger.onElement(ProcessingTimeTrigger.java:28)
at org.apache.flink.streaming.runtime.operators.windowing.WindowOperator$Context.onElement(WindowOperator.java:926)
at org.apache.flink.streaming.runtime.operators.windowing.WindowOperator.processElement(WindowOperator.java:393)
at org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:207)
at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:69)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:264)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
at java.lang.Thread.run(Thread.java:745)
code as:
stream.keyBy(...).timeWindow(Time.minutes(5)).apply(...)
the implement of keyBy is String result. Is there any idea about it? I have seen code in HeapInternalTimerService, but what is the case that keyGroupId out of local range?

I see two possibilities that could lead to this error.
Your key extractor function is not deterministic, i.e., it might return different values.
There is a bug in Flink.
Please check that 1. is not the case. If you are sure that the key extractor is not the problem, please reach out to the Flink user mailing list or create a Jira issue.

Related

How to Implement Patterns to Match Brute Force Login and Port Scanning Attacks using Flink CEP

I have a use case where a large no of logs will be consumed to the apache flink CEP. My use case is to find the brute force attack and port scanning attack. The challenge here is that while in ordinary CEP we compare the value against a constant like "event" = login. In this case the Criteria is different as in the case of brute force attack we have the criteria as follows.
username is constant and event="login failure" (Delimiter the event happens 5 times within 5 minutes).
It means the logs with the login failure event is received for the same username 5 times within 5 minutes
And for port Scanning we have the following criteira.
ip address is constant and dest port is variable (Delimiter is the event happens 10 times within 1 minute). It means the logs with constant ip address is received for the 10 different ports within 1 minute.
With Flink, when you want to process the events for something like one username or one ip address in isolation, the way to do this is to partition the stream by a key, using keyBy(). The training materials in the Flink docs have a section on Keyed Streams that explains this part of the DataStream API in more detail. keyBy() is the roughly same concept as a GROUP BY in SQL, if that helps.
With CEP, if you first key the stream, then the pattern will be matched separately for each distinct value of the key, which is what you want.
However, rather than CEP, I would instead recommend Flink SQL, perhaps in combination with MATCH_RECOGNIZE, for this use case. MATCH_RECOGNIZE is a higher-level API, built on top of CEP, and it's easier to work with. In combination with SQL, the result is quite powerful.
You'll find some Flink SQL training materials and examples (including examples that use MATCH_RECOGNIZE) in Ververica's github account.
Update
To be clear, I wouldn't use MATCH_RECOGNIZE for these specific rules; neither it nor CEP is needed for this use case. I mentioned it in case you have other rules where it would be helpful. (My reason for not recommending CEP in this case is that implementing the distinct constraint might be messy.)
For example, for the port scanning case you can do something like this:
SELECT e1.ip, COUNT(DISTINCT e2.port)
FROM events e1, events e2
WHERE e1.ip = e2.ip AND timestampDiff(MINUTE, e1.ts, e2.ts) < 1
GROUP BY e1.ip HAVING COUNT(DISTINCT e2.port) >= 10;
The login case is similar, but easier.
Note that when working with streaming SQL, you should give some thought to state retention.
Further update
This query is likely to return a given IP address many times, but it's not desirable to generate multiple alerts.
This could be handled by inserting matching IP addresses into an Alert table, and only generate alerts for IPs that aren't already there.
Or the output of the SQL query could be processed by a de-duplicator implemented using the DataStream API, similar to the example in the Flink docs. If you only want to suppress duplicate alerts for some period of time, use a KeyedProcessFunction instead of a RichFlatMapFunction, and use a Timer to clear the state when it's time to re-enable alerts for a given IP.
Yet another update (concerning CEP and distinctness)
Implementing this with CEP should be possible. You'll want to key the stream by the IP address, and have a pattern that has to match within one minute.
The pattern can be roughly like this:
Pattern<Event, ?> pattern = Pattern
.<Event>begin("distinctPorts")
.where(iterative condition 1)
.oneOrMore()
.followedBy("end")
.where(iterative condition 2)
.within(1 minute)
The first iterative condition returns true if the event being added to the pattern has a distinct port from all of the previously matching events. Somewhat similar to the example here, in the docs.
The second iterative condition returns true if size("distinctPorts") >= 9 and this event also has yet another distinct port.
See this Flink Forward talk (youtube video) for a somewhat similar example at the end of the talk.
If you try this and get stuck, please ask a new question, showing us what you've tried and where you're stuck.

How to I relate the metrics in Datadog with execution plan operators in Flink?

In my case scenario, Flink is sending the metrics to Datadog. Datadog Host map is as shown below { I have no Idea why is showing me latency here }
Flink metrics are sent to localhost. The issue here is that when
flink-conf.yaml file configuration is as follows
# adding metrics
metrics.reporters: stsd , dghttp
metrics.reporter.stsd.class: org.apache.flink.metrics.statsd.StatsDReporter
metrics.reporter.stsd.host: localhost
metrics.reporter.stsd.port: 8125
# for datadog
metrics.reporter.dghttp.class: org.apache.flink.metrics.datadog.DatadogHttpReporter
metrics.reporter.dghttp.apikey: xxx
metrics.reporter.dghttp.tags: host:localhost, job_id : jobA , tm_id : task1 , operator_name : operator1
metrics.scope.operator: numRecordsIn
metrics.scope.operator : numRecordsInPerSecond
metrics.scope.operator : numRecordsOut
metrics.scope.operator : numRecordsOutPerSecond
metrics.scope.operator : latency
The issue is that Datadog is showing 163 metrics which I don't understand, which I will explain in a while
I don't understand the metrics format in datadog as it shows me metrics something like this
Now as shown in above Image
Latency is expressed in time
Number of events per second is event /sec
count is some value
So my question is that which metric is this?
Also, the execution plan of my job is something like this
How do I relate the metrics in Datadog with execution plan operators in Flink?
I have read in Flink API 1.3.2 that I can use tags, I have tried to use them in flink-conf.yaml file but I don't have complete Idea what sense they make here.
My ultimate goal is to find operator latency, number of records out and in /second at each operator in this case
There are a variety of issues here.
1. You've misconfigured the scope formats. (metrics.scope.operator)
For one the configuration doesn't make sense since you specify "metrics.scope.operator" multiple times; only the last config entry is honored.
Second, and more importantly, you have misunderstood for scope formats are used for.
Scope formats configure which context information (like the ID of the task) is included in the reported metric's name.
By setting it to a constant ("latency") you've told Flink to not include anything. As a result, the numRecordsIn metrics for every operator is reported as "latency.numRecordsIn".
I suggest to just remove your scope configuration.
2. You've misconfigured the Datadog Tags
I do not understand what you were trying to do with your tags configuration.
The tags configuration option can only be used to provide global tags, i.e. tags that are attached to every single metrics, like "Flink".
By default every metric that the Datadog reports has tags attached to it for every available scope variable available.
So, if you have an operator name A, then the numRecordsIn metric will be reported with a tag "operator_name:A".
Again, I would suggest to just remove your configuration.

Correct way to retrieve a single object from Realm database

I am absolutely loving Realm (0.92) in combination with Swift but have a question about reading an object from the database. My goal is to retrieve a single object with a known, unique ID (which also happens to be the primary key.
All the documentation appears to be oriented around queries for multiple objects which are then filtered. In this case I know the object ID and, since it is known to be unique, would like to retrieve it directly.
My current approach is as follows:
Realm().objects(Book).filter("id == %#", prevBook.nextID).first
This seems heavy-handed. Documentation from prior versions suggest that there is a more direct way but I can't seem to locate it in the documentation.
The problem with my current approach is that it is crashing with an exception on the following function:
public func filter(predicateFormat: String, _ args: CVarArgType...) -> Results<T>
The exception is mysteriously reported as:
EXC_BAD_ACCESS (code=1, address=0xedf)
Any suggestions are very welcome.
Anticipating one line of questioning: I have confirmed that replacing prevBook.nextID with a known, good ID does not solve the problem
object(ofType:forPrimaryKey:) is what you're looking for: Realm().object(ofType: Book.self, forPrimaryKey: prevBook.nextId). There's no simpler way than filter().first if you need to search for the object by something other than the primary key.

BadArgumentError: _MultiQuery with cursors requires __key__ order in ndb

I can't understand what this error means and apparently, no one ever got the same error on the internet
BadArgumentError: _MultiQuery with cursors requires __key__ order
This happens here:
return SocialNotification.query().order(-SocialNotification.date).filter(SocialNotification.source_key.IN(nodes_list)).fetch_page(10)
The property source_key is obviously a key and nodes_list is a list of entity keys previously retrieved.
What I need is to find all the SocialNotifications that have a field source_key that match one of the keys in the list.
The error message tries to tell you you that queries involving IN and cursors must be ordered by __key__ (which is the internal name for the key of the entity). (This is needed so that the results can be properly merged and made unique.) In this case you have to replace your .order() call with .order(SocialNotification._key).
It seems that this also happens when you filter for an inequality and try to fetch a page.
(e.g. MyModel.query(MyModel.prop != 'value').fetch_page(...) . This basically means (unless i missed something) that you can't fetch_page when using an inequality filter because on one hand you need the sort to be MyModel.prop but on the other hand you need it to be MyModel._key, which is hard :)
I found the answer here: https://developers.google.com/appengine/docs/python/ndb/queries#cursors
You can change your query to:
SocialNotification.query().order(-SocialNotification.date, SocialNotification.key).filter(SocialNotification.source_key.IN(nodes_list)).fetch_page(10)
in order to get this to work. Note that it seems to be slow (18 seconds) when nodes_list is large (1000 entities), at least on the Development server. I don't have a large amount of test
data on a test server.
You need the property you want to order on and key.
.order(-SocialNotification.date, SocialNotification.key)
I had the same error when filtering without a group.
The error occurred every time my filter returned more than one result.
To fix it I actually had to add ordering by key.

EXCEEDED_ID_LIMIT: emptyRecycleBin id limit reached: 200

I'm just wondering if anyone else has seen this and if so, can you confirm that this is correct? The documentation claims, as you might expect, that 10,000 is the record limit for the system call:
Database.emptyRecycleBin(records);
not 200. Yet it's throwing an error at 200. The only thing I can think of is that this call occurs from within a batch Apex process.
It took a little over a week and me supplying a failing test case to salesforce support but the issue is now being reported as a salesforce known issue suggesting it may get addressed in the platform.
My workaround for now is to wrap the call in a Database.Batchable with the batch size to 200.
This is the only reference that I could find to there being a limit of 200 on emptyrecyclebin(), I dare say that you are correct
http://www.salesforce.com/us/developer/docs/api/Content/sforce_api_calls_emptyrecyclebin.htm
Adam, if you got shut down when attempting to log a case regarding this due to the whole Premier Support thing you should definitely escalate your case as it was handled incorrectly and SFDC needs to know about it. I had the same exact issue myself.
SOQL For Loops may be a helpful option for working around this limit as the 'for (Account[] accounts : [SELECT Id FROM Account WHERE IsDeleted = true ALL ROWS]' format provides batches of 200.

Resources