Parse string in json format from Kafka using Flink - apache-flink

What I want to do is reading a string in json format e.g.
{"a":1, "b":2}
using flink and then extract a specific value by its key, say 1.
Refer to here: https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/connectors/kafka.html
What I have done is:
val params = ParameterTool.fromArgs(args)
val env = StreamExecutionEnvironment.getExecutionEnvironment
val kafkaConsumer = new FlinkKafkaConsumer010(
params.getRequired("input-topic"),
new JSONKeyValueDeserializationSchema(false),
params.getProperties
)
val messageStream = env.addSource(kafkaConsumer)
But I am not quite sure how to move forward then. In the link above, it says I can use objectNode.get(“field”).as(Int/String/…)() to extract a specific value by key, but I wonder how can I do that?
Or there can be a completely different way to achieve what I want?
Thanks!

Apply data transformation on the data from Kafka like this:
messageStream.map(new MapFunction<ObjectNode, Object>() {
#Override
public Object map(ObjectNode value) throws Exception {
value.get("field").as(...)
}
})

Related

Proper way to assign watermark with DateStreamSource<List<T>> using Flink

I have a continuing JSONArray data produced to Kafka topic,and I wanna process records with EventTime characteristic.In order to reach this goal,I have to assign watermark to each record which contained in the JSONArray.
I didn't find a convenience way to achieve this goal.My solution is consuming data from DataStreamSource> ,then iterate List and collect Object to downstream with an anonymous ProcessFunction,finally assign watermark to the this downstream.
The major code shows below:
DataStreamSource<List<MockData>> listDataStreamSource = KafkaSource.genStream(env);
SingleOutputStreamOperator<MockData> convertToPojo = listDataStreamSource
.process(new ProcessFunction<List<MockData>, MockData>() {
#Override
public void processElement(List<MockData> value, Context ctx, Collector<MockData> out)
throws Exception {
value.forEach(mockData -> out.collect(mockData));
}
});
convertToPojo.assignTimestampsAndWatermarks(
new BoundedOutOfOrdernessTimestampExtractor<MockData>(Time.seconds(5)) {
#Override
public long extractTimestamp(MockData element) {
return element.getTimestamp();
}
});
SingleOutputStreamOperator<Tuple2<String, Long>> countStream = convertToPojo
.keyBy("country").window(
SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(10)))
.process(
new FlinkEventTimeCountFunction()).name("count elements");
The code seems all right without doubt,running without error as well.But ProcessWindowFunction never triggered.I tracked the Flink source code,find EventTimeTrigger never returns TriggerResult.FIRE,causing by TriggerContext.getCurrentWatermark returns Long.MIN_VALUE all the time.
What's the proper way to process List in eventtime?Any suggestion will be appreciated.
The problem is that you are applying the keyBy and window operations to the convertToPojo stream, rather than the stream with timestamps and watermarks (which you didn't assign to a variable).
If you write the code more or less like this, it should work:
listDataStreamSource = KafkaSource ...
convertToPojo = listDataStreamSource.process ...
pojoPlusWatermarks = convertToPojo.assignTimestampsAndWatermarks ...
countStream = pojoPlusWatermarks.keyBy ...
Calling assignTimestampsAndWatermarks on the convertToPojo stream does not modify that stream, but rather creates a new datastream object that includes timestamps and watermarks. You need to apply your windowing to that new datastream.

How to get state for multiple keyBy in flink using queryable state client?

I am using Flink 1.4.2 and I have one scenario in which I need to use two keys.
For e.g.
KeyedStream<UsageStatistics, Tuple> keyedStream = stream.keyBy("clusterId", "ssid");
usageCounts = keyedStream.process(new CustomProcessFunction(windowSize,queryableStateName));
Value Description would
ValueStateDescriptor<SsidTotalUsage> descriptor = new ValueStateDescriptor(queryableStateName, SsidTotalUsage.class);
descriptor.setQueryable(queryableStateName);
Can anyone please suggest me to get state using queryable state client for multiple keys in flink?
Below QueryableClient is working well for a single key 'clusterId'.
kvState = queryableStateClient.getKvState(JobID.fromHexString(jobId), queryableStateName, clusterId, BasicTypeInfo.STRING_TYPE_INFO, descriptor);
What should be the type_info for multiple keys? Any suggestion/example or reference related to this will be very helpful?
I found the solution.
I have given TypeHint in valueStateDescription.
In Flink Job:
TypeInformation<SsidTotalUsage> typeInformation = TypeInformation.of(new TypeHint<SsidTotalUsage>() {});
ValueStateDescriptor<SsidTotalUsage> descriptor = new ValueStateDescriptor(queryableStateName, typeInformation);
On Client Side:
ValueStateDescriptor<SsidTotalUsage> descriptor = new ValueStateDescriptor(queryableStateName, typeInformation);
I have two key so I had used Tuple2 class and set the value of my keys like below.
Note: If you have more than two keys then you have to select Tuple3, Tuple4 class according to your keys.
Tuple2<String, String> tuple = new Tuple2<>();
tuple.f0 = clusterId;
tuple.f1 = ssid;
Then I have provided TypeHint.
TypeHint<Tuple2<String, String>> typeHint = new TypeHint<Tuple2<String, String>>() {};
CompletableFuture<ValueState<SsidTotalUsage>> kvState = queryableStateClient.getKvState(JobID.fromHexString(jobId), queryableStateName, tuple, typeHint, descriptor);
In Above code, getState method will return ImmutableValueState
so I need to get the my pojo like below.
ImmutableValueState<SsidTotalUsage> state = (ImmutableValueState<SsidTotalUsage>) kvState.get();
totalUsage = state.value();

Creating & Setting a Map into context through SpringEl

As SpringEl doc. indicates, there is el syntax for creating a list which then allows me setting it into the context as below:
List numbers = (List) parser.parseExpression("map['innermap']['newProperty']={1,2,3,4}").getValue(context);
However, I am not able to find a way of doing the same thing for Map nor I can find it in the document.
Is there a short hand way of creating a map and then setting it to context? if not, how can we go about it.
If possible a code snippet will be helpful.
Thanks in advance.
It's now possible (since 4.1, I think):
{key:value, key:value}
http://docs.spring.io/spring/docs/current/spring-framework-reference/html/expressions.html#expressions-inline-maps
No, it isn't possible yet: https://jira.spring.io/browse/SPR-9472
But you can do it with some util method, which should be registered as SpEL-function:
parser.parseExpression("#inlineMap('key1: value1, key2:' + value2)");
Where you have to parse the String arg to the Map.
UPDATE
Please, read this paragraph: http://docs.spring.io/spring/docs/current/spring-framework-reference/html/expressions.html#expressions-ref-functions.
From big height it should be like this:
public abstract class StringUtils {
public static Map<String, Object> inlineMap(String input) {
// Here is a code to parse 'input' string and build a Map
}
}
context.registerFunction("inlineMap",
StringUtils.class.getDeclaredMethod("inlineMap", new Class[] { String.class }));
parser.parseExpression("#inlineMap('key1: value1, key2:' + value2)")
.getValue(context, rootObject);

Storing JSON document with AppEngine

I'm trying to store JSON document into the AppEngine datastore using Objectify as persistence layer. To be able to query for document values, instead of just inserting the whole document as String field, I created a MapEntity which looks like this:
#Entity(name="Map")
public class MapEntity {
#Id
private Long id;
private Map<String,String> field;
// Code omitted
}
Since eventually when "unrolled" every key-value in the JSON document can be represented with Map
Example:
String subText = "{\"first\": 111, \"second\": [2, 2, 2], \"third\": 333}";
String jsonText = "{\"first\": 123, \"second\": [4, 5, 6], \"third\": 789, \"fourth\":"
+ subText + "}";
I will have the map fields stored in the datastore:
KEY VALUE
field.first => 123
field.second => [4,5,6]
field.third => 789
field.fourth-first => 111
field.fourth-second => [2,2,2]
field.fourth-third => 333
If I use my parse() method:
Parse the JSON document using JSON.Simple library and then do a recursive parse:
private MapEntity parse(String root, MapEntity entity, Map json) {
Iterator iter = json.entrySet().iterator();
while (iter.hasNext()) {
Map.Entry entry = (Map.Entry) iter.next();
if (entry.getValue() instanceof Map){
entity = parse((String)entry.getKey()+"-", entity, (Map) entry.getValue());
System.out.println("Map instance");
} else {
entity.setField(root + String.valueOf(entry.getKey()), String.valueOf(entry.getValue()));
}
}
return entity;
}
My app works like this:
MapEntity jsonEntity = new MapEntity();
Map json = null;
json = (Map) parser.parse(jsonText, containerFactory); // JSON.Simple parser
jsonEntity = parse("", jsonEntity, json);
Problems I encounter are:
I can't use the "." dot in the Map key field, so I have to use the "-"
Also my approach in storing JSON document is not very efficient
If your JSON follows a strict format, you'd probably be better off constructing a class to represent your data format and serializing directly to and from that class using a library like Jackson. You can use that class directly as your entity class in Objectify, but whether you want to do depends on whether you want to:
Store and expose the exact same set of data
Tightly couple your storage and JSON representations
You could use JSONObject as a replacement for your MapEntity and store the json to google app engine as a string using the toString() method. Upon retrieval you could simply restore the JSONObject using the appropriate constructor. This, of course, limits your ability to index properties in app engine and query against them.
If you want Objectify to do this for you, you could register a Translator to take care of calling the toString() and reconstruction.

Write and read file line by line with Camel

I would like to write a byte arrays in a file with Camel. But, in order to get back my arrays, I want to write them line by line, or with another separator.
How to do that with Camel ?
from(somewhere)
.process(new Processor() {
#Override
public void process(final Exchange exchange) throws Exception {
final MyObject body = exchange.getIn().getBody(MyObject.class);
byte[] serializedObject = MySerializer.serialize(body);
exchange.getOut().setBody(serializedObject);
exchange.getOut().setHeader(Exchange.FILE_NAME, "filename");
}
}).to("file://filepath?fileExist=Append&autoCreate=true");
Or is anyone have another way to get them back ?
PS : I need to have only one file, otherwise it would have been too easy ...
EDIT :
I successfully write my file line by line with the out.writeObject method (Thanks to Petter). And I can read them with :
InputStream file = new FileInputStream(FILENAME);
InputStream buffer = new BufferedInputStream(file);
input = new ObjectInputStream(buffer);
Object obj = null;
while ((obj = input.readObject()) != null) {
// Do something
}
But I not able to split and read them with camel. Do you have any idea to read them with Camel ?
It depends on what your serialized object looks like, since you seem to have your own serializer. Is it standard java binary
ByteArrayOutputStream bos = new ByteArrayOuputStream();
ObjectOutput out = new ObjectOutputStream(bos);
out.writeObject(obj);
return bos.toByteArray();
I probably won't be such a great idea to use text based separators like \n.
Can't you serialize into some text format instead? Camel has several easy to use data formats: (http://camel.apache.org/data-format.html). Xstream, for instance, is a line of code or so, to create XML from your objects, then it's not big deal to split the file into several XML parts and read them back with XStream.
In your example, if you really want a separator, why don't you just append it to the byte[]? Copy the array to a new, bigger byte[] and insert some unique sequence in the end.

Resources