How to handle errors in custom MapFunction correctly? - apache-flink

I have implemented MapFunction for my Apache Flink flow. It is parsing incoming elements and convert them to other format but sometimes error can appear (i.e. incoming data is not valid).
I see two possible ways how to handle it:
Ignore invalid elements but seems like I can't ignore errors because for any incoming element I must provide outgoing element.
Split incoming elements to valid and invalid but seems like I should use other function for this.
So, I have two questions:
How to handle errors correctly in my MapFunction?
How to implement such transformation functions correctly?

You could use a FlatMapFunction instead of a MapFunction. This would allow you to only emit an element if it is valid. The following shows an example implementation:
input.flatMap(new FlatMapFunction<String, Long>() {
#Override
public void flatMap(String input, Collector<Long> collector) throws Exception {
try {
Long value = Long.parseLong(input);
collector.collect(value);
} catch (NumberFormatException e) {
// ignore invalid data
}
}
});

This is to build on #Till Rohrmann's idea above. Adding this as an answer instead of a comment for better formatting.
I think one way to implement "split + select" could be to use a ProcessFunction with a SideOutput. My graph would look something like this:
Source --> ValidateProcessFunction ---good data--> UDF--->SinkToOutput
\
\---bad data----->SinkToErrorChannel
Would this work? Is there a better way?

Related

How does Flink treat timestamps within iterative loops?

How are timestamps treated within an iterative DataStream loop within Flink?
For example, here is an example of a simple iterative loop within Flink where the feedback loop is of a different type to the input stream:
DataStream<MyInput> inputStream = env.addSource(new MyInputSourceFunction());
IterativeStream.ConnectedIterativeStreams<MyInput, MyFeedback> iterativeStream = inputStream.iterate().withFeedbackType(MyFeedback.class);
// define an output tag so we can emit feedback objects via a side output
final OutputTag<MyFeedback> outputTag = new OutputTag<MyFeedback>("feedback-output"){};
// now do some processing
SingleOutputStreamOperator<MyOutput> combinedStreams = iterativeStream.process(new CoProcessFunction<MyInput, MyFeedback, MyOutput>() {
#Override
public void processElement1(MyInput value, Context ctx, Collector<MyOutput> out) throws Exception {
// do some processing of the stream of MyInput values
// emit MyOutput values downstream by calling out.collect()
out.collect(someInstanceOfMyOutput);
}
#Override
public void processElement2(MyFeedback value, Context ctx, Collector<MyOutput> out) throws Exception {
// do some more processing on the feedback classes
// emit feedback items
ctx.output(outputTag, someInstanceOfMyFeedback);
}
});
iterativeStream.closeWith(combinedStreams.getSideOutput(outputTag));
My questions revolve around how does Flink use timestamps within a feedback loop:
Within the ConnectedIterativeStreams, how does Flink treat ordering of the input objects across the streams of regular inputs and feedback objects? If I emit an object into the feedback loop, when will it be seen by the head of the loop with respect to the regular stream of input objects?
How does the behaviour change when using event time processing?
AFAICT, Flink doesn't provide any guarantees on the ordering of input objects. I've run into this when trying to use iterations for a clustering algorithm in Flink, where the centroid updates don't get processed in a timely manner. The only solution I found was to essentially create a single (unioned) stream of the incoming events and the centroid updates, versus using a co-stream.
FYI there's this proposal to address some of the short-comings of iterations.

How to handle exception properly in flink operator?

If I throw runtime exception in flink operator, how it will be handled?
I just want to ignore this exception and continue handle the stream, but I do not know the side effect if I just ignore them. Will this exception stop the whole data stream?
If one of your operators will throw an exception the entire job will fail. It doesn't differ that much from a normal application: If an exception is not handled by anybody the application will fail.
Think of this this way: You throw an exception if you don't know how to handle a certain situation - this somehow indicates whoever called me should take care of this. At least I am not aware of any way how you could tell Flink: Please ignore my exceptions.
My proposal: Handle the exceptions within your operators. This might mean you have to change the type of the operator (or at least the return type).
E.g.
case class MyMapper() extends MapFunction[Double, Double] {
override def map(in : Double) : Double {
try {
1/in
}
catch {
case e : java.lang.ArithmeticException:
throw new RuntimeException(e)
}
}
}
Might become a FlatMap operator just returning nothing. Or you change the return type of this operator to return an Option of Double instead of a Double.
Hope that helps

How to use Collections.binarySearch() in a CodenameOne project

I am used to being able to perform a binary search of a sorted list of, say, Strings or Integers, with code along the lines of:
Vector<String> vstr = new Vector<String>();
// etc...
int index = Collections.binarySearch (vstr, "abcd");
I'm not clear on how codenameone handles standard java methods and classes, but it looks like this could be fixed easily if classes like Integer and String (or the codenameone versions of these) implemented the Comparable interface.
Edit: I now see that code along the lines of the following will do the job.
int index = Collections.binarySearch(vstr, "abcd", new Comparator<String>() {
#Override
public int compare(String object1, String object2) {
return object1.compareTo(object2);
}
});
Adding the Comparable interface (to the various primitive "wrappers") would also would also make it easier to use Collections.sort (another very useful method :-))
You can also sort with a comparator but I agree, this is one of the important enhancements we need to provide in the native VM's on the various platforms personally this is my biggest peeve in our current VM.
Can you file an RFE on that and mention it as a comment in the Number issue?
If we are doing that change might as well do both.

multicast to a String collection

Is there an easy may to send a message to many endpoints, defined in a List<String>?
For exemple, I would like to do :
from("file://c:/inbox").to(myStringList);
Yes you will need to turn that list into an array as shown. Though doing this in pure Java is a bit ugly as the code shows:
String[] s = myStringList.toArray(new String[myStringList.size()]);
from("file://c:/inbox").to(s);

Retrieving Specific Active Directory Record Attributes using C#

I've been asked to set up a process which monitors the active directory, specifically certain accounts, to check that they are not locked so that should this happen, the support team can get an early warning.
I've found some code to get me started which basically sets up requests and adds them to a notification queue. This event is then assigned to a change event and has an ObjectChangedEventArgs object passed to it.
Currently, it iterates through the attributes and writes them to a text file, as so:
private static void NotifierObjectChanged(object sender,
ObjectChangedEventArgs e)
{
if (e.ResultEntry.Attributes.AttributeNames == null)
{
return;
}
// write the data for the user to a text file...
using (var file = new StreamWriter(#"C:\Temp\UserDataLog.txt", true))
{
file.WriteLine("{0} {1}", DateTime.UtcNow.ToShortDateString(), DateTime.UtcNow.ToShortTimeString());
foreach (string attrib in e.ResultEntry.Attributes.AttributeNames)
{
foreach (object item in e.ResultEntry.Attributes[attrib].GetValues(typeof(string)))
{
file.WriteLine("{0}: {1}", attrib, item);
}
}
}
}
What I'd like is to check the object and if a specific field, such as name, is a specific value, then check to see if the IsAccountLocked attribute is True, otherwise skip the record and wait until the next notification comes in. I'm struggling how to access specific attributes of the ResultEntry without having to iterate through them all.
I hope this makes sense - please ask if I can provide any additional information.
Thanks
Martin
This could get gnarly depending upon your exact business requirements. If you want to talk in more detail ping me offline and I'm happy to help over email/phone/IM.
So the first thing I'd note is that depending upon what the query looks like before this, this could be quite expensive or error prone (ie missing results). This worries me somewhat as most sample code out there gets this wrong. :) How are you getting things that have changed? While this sounds simple, this is actually a somewhat tricky question in directory land, given the semantics supported by AD and the fact that it is a multi-master system where writes happen all over the place (and replicate in after the fact).
Other variables would be things like how often you're going to run this, how large the data set could be in AD, and so on.
AD has some APIs built to help you here (the big one that comes to mind is called DirSync) but this can be somewhat complicated if you haven't used it before. This is where the "ping me offline" part comes in.
To your exact question, I'm assuming your result is actually a SearchResultEntry (if not I can revise, tell me what you have in hand). If that is the case then you'll find an Attributes field hanging off of that guy, and from there there is AttributeNames and Values. I think you'll see how it works from there if you have Values in hand, for example:
foreach (var attr in sre.Attributes.Values)
{
var da = (DirectoryAttribute)attr;
Console.WriteLine(da.Name);
foreach (var val in da.GetValues(typeof(byte[])))
{
// Handle a byte[] val ...
}
}
As I said, if you have something other than a SearchResultEntry in hand, let us know and I can revise the code sample.

Resources