Flink DataSet Tuple Values not coming as expected

Flink DataSet Tuple Values not coming as expected - apache-flink

I have a Dataset<Tuple3<String,String,Double>> values which has the following data:
<Vijaya,Chocolate,5>
<Vijaya,Chips,10>
<Rahul,Chocolate,2>
<Rahul,Chips,8>
I want the DataSet<Tuple5<String,String,Double,String,Double>> values1as following:
<Vijaya,Chocolate,5,Chips,10>
<Rahul,Chocolate,2,Chips,8>
My code looks like following:
DataSet<Tuple5<String, String, Double, String, Double>> values1 = values.fullOuterJoin(values)
.where(0)
.equalTo(0)
.with(
new JoinFunction<Tuple3<String, String, Double>, Tuple3<String, String, Double>, Tuple5<String, String, Double, String, Double>>() {
private static final long serialVersionUID = 1L;
public Tuple5<String, String, Double, String, Double> join(Tuple3<String, String, Double> first, Tuple3<String, String, Double> second) {
return new Tuple5<String, String, Double, String, Double>(first.f0, first.f1, first.f2, second.f1, second.f2);
}
})
.distinct(1, 3)
.distinct(1);
In the above code I tried doing self join.I want the output in that particular format but I am unable to get it.
How to do this?
Please help.

Since you don't want the output to have the same item repeated, you can use a flat-join, in which you can output only those records that have the value in the 2nd position not equal to the value in the 4th position. Also, if you want only "chocolate" in the 2nd position, that can also be checked inside the FlatJoinFunction. Please find below the link to Flink's documentation about Flat-join.
https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/batch/dataset_transformations.html#join-with-flat-join-function
Approach using GroupReduceFunction:
values
.groupBy(0)
.reduceGroup(new GroupReduceFunction<Tuple3<String,String,Double>, Tuple2<String, String>>() {
#Override
public void reduce(Iterable<Tuple3<String,String,Double>> in, Collector<Tuple2<String, String>> out) {
StringBuilder output = new StringBuilder();
String name = null;
for (Tuple3<String,String,Double> item : in) {
name = item.f0;
output.append(item.f1+","+item.f2+",");
}
out.collect(new Tuple2<String, String>(name,output.toString()));
}
});

Related

How to get comma separated value using JPA in Database

enter image description here
Currently I've been problem what get my database data.. Because That's fileIdx Column was be Comma Seperated value
Problem is..
Caused by: java.sql.SQLException: Out of range value for column 'fileidx7_4_' : value 322,323
at org.mariadb.jdbc.internal.com.read.resultset.rowprotocol.TextRowProtocol.getInternalLong(TextRowProtocol.java:349)
at org.mariadb.jdbc.internal.com.read.resultset.SelectResultSet.getLong(SelectResultSet.java:1001)
at org.mariadb.jdbc.internal.com.read.resultset.SelectResultSet.getLong(SelectResultSet.java:995)
at com.zaxxer.hikari.pool.HikariProxyResultSet.getLong(HikariProxyResultSet.java)
at org.hibernate.type.descriptor.sql.BigIntTypeDescriptor$2.doExtract(BigIntTypeDescriptor.java:63)
at org.hibernate.type.descriptor.sql.BasicExtractor.extract(BasicExtractor.java:47)
at org.hibernate.type.AbstractStandardBasicType.nullSafeGet(AbstractStandardBasicType.java:257)
at org.hibernate.type.AbstractStandardBasicType.nullSafeGet(AbstractStandardBasicType.java:253)
at org.hibernate.type.AbstractStandardBasicType.nullSafeGet(AbstractStandardBasicType.java:243)
at org.hibernate.type.AbstractStandardBasicType.hydrate(AbstractStandardBasicType.java:329)
at org.hibernate.type.ManyToOneType.hydrate(ManyToOneType.java:184)
at org.hibernate.persister.entity.AbstractEntityPersister.hydrate(AbstractEntityPersister.java:3088)
at org.hibernate.loader.Loader.loadFromResultSet(Loader.java:1907)
at org.hibernate.loader.Loader.hydrateEntityState(Loader.java:1835)
at org.hibernate.loader.Loader.instanceNotYetLoaded(Loader.java:1808)
at org.hibernate.loader.Loader.getRow(Loader.java:1660)
at org.hibernate.loader.Loader.getRowFromResultSet(Loader.java:745)
at org.hibernate.loader.Loader.getRowsFromResultSet(Loader.java:1044)
at org.hibernate.loader.Loader.processResultSet(Loader.java:995)
at org.hibernate.loader.Loader.doQuery(Loader.java:964)
at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:350)
at org.hibernate.loader.Loader.doList(Loader.java:2887)
... 108 more
It is my Entity ↓↓↓↓↓
#Entity
public class Board {
#Id
#GeneratedValue(strategy = GenerationType.IDENTITY)
private Long idx;
private Long code;
private Long adminIdx;
private String write;
private String title;
private String content;
private String fileIdx;
private String deleted;
#Column(name = "sDate")
private Long sdate;
#Column(name = "eDate")
private Long edate;
private Long regdate;
It is my JpaRepository Method used JPQL ↓↓↓↓↓
#Query(value = "select b from Board b")
List<Board> getTest();
It is my TestCode ↓↓↓↓↓
#Test
public void test(){
List<Board> shopBoardEntity = boardRepository.getTest();
shopBoardEntity.forEach(o -> {
System.out.println("dddd : " + o.getTitle());
});
}
No matter how hard I try, I don't know..
Anyone Idea??

To convert object node to json node

Here DataStream returns keyvalue pair as a object i need key value directly not as a object becoz i need to group the values based on key.
DataStream<ObjectNode> stream = env
.addSource(new FlinkKafkaConsumer<>("test5", new JSONKeyValueDeserializationSchema (false), properties));
// stream.keyBy("record1").print();
when i give stream.keyby("record1").print();
it shows
Exception in thread "main" org.apache.flink.api.common.InvalidProgramException: This type (GenericType<org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.node.ObjectNode>) cannot be used as key.
at org.apache.flink.api.common.operators.Keys$ExpressionKeys.<init>(Keys.java:330)
at org.apache.flink.streaming.api.datastream.DataStream.keyBy(DataStream.java:337)
at ReadFromKafka.main(ReadFromKafka.java:27)

David Anderson's response is correct, as an addition, I can add that You can simply create the KeySelector that will extract the key as String. It could look like this:
public class JsonKeySelector implements KeySelector<ObjectNode, String> {
#Override
public String getKey(ObjectNode jsonNodes) throws Exception {
return jsonNodes.get("key").asText();
}
}
This obviously assumes that the Key is supposed to be String.

There are several ways to specify the key selector in a Flink keyBy. For example, if you have a POJO of type Event with the String key in a field named "id", any of these will work:
stream.keyBy("id")
stream.keyBy(event -> event.id)
stream.keyBy(
new KeySelector<Event, String>() {
#Override
public String getKey(Event event) throws Exception {
return event.id;
}
}
)
So long as you can compute the key from the object in a deterministic way, you can make this work.

Flink map tuple to string

Hi I have some tuple Tuple2<String, Integer> that i want to convert to string an then send it to KAFKA.
Im trying to figure out a way to to iterate the tuple and create one string from it so if i have N elements in my tuple i want to create a string that contain them.
I tried flat map it but im geting new string for each element in the tuple.
SingleOutputStreamOperator<String> s = t.flatMap(new FlatMapFunction<Tuple2<String, Integer>, String>() {
#Override
public void flatMap(Tuple2<String, Integer> stringIntegerTuple2, Collector<String> collector) throws Exception {
collector.collect(stringIntegerTuple2.f0 + stringIntegerTuple2.f1);
}
});
What is the correct way to create on string out of tuple .

You can just override the .toString() method of a tuple with a custom class and use that. Like this:
import org.apache.flink.api.java.tuple.Tuple3;
public class CustomTuple3 extends Tuple3 {
#Override
public String toString(){
return "measurment color=" + this.f0.toString() + " color=" + this.f1.toString() + " color=" + this.f2.toString();
}
}
So now just use a CustomTuple3 object instead of a Tuple3 and when you populate it and call .toString() on it, it will output that formatted string.

The begin and end time of windows

What would be the way to show the begin and end time of windows? something like implement user-defined windows?
Would like to know the time that windows begin and evaluate such that the output is
quantity(WindowAll Sum), window_start_time, window_end_time
12, 1:13:21, 1:13:41
6, 1:13:41, 1:15:01

Found the answer. TimeWindow.class has getStart() and getEnd()
example usage:
public static class SumAllWindow implements AllWindowFunction<Tuple2<String,Integer>,
Tuple3<Integer, String, String>, TimeWindow> {
private static transient DateTimeFormatter timeFormatter =
DateTimeFormat.forPattern("yyyy-MM-dd'T'HH:mm:ss.SS").withLocale(Locale.GERMAN).
withZone(DateTimeZone.forID("Europe/Berlin"));
#Override
public void apply (TimeWindow window, Iterable<Tuple2<String, Integer>> values,
Collector<Tuple3<Integer, String, String>> out) throws Exception {
DateTime startTs = new DateTime(window.getStart(), DateTimeZone.forID("Europe/Berlin"));
DateTime endTs = new DateTime(window.getEnd(), DateTimeZone.forID("Europe/Berlin"));
int sum = 0;
for (Tuple2<String, Integer> value : values) {
sum += value.f1;
}
out.collect(new Tuple3<>(sum, startTs.toString(timeFormatter), endTs.toString(timeFormatter)));
}
}
in main()
msgStream.timeWindowAll(Time.of(6, TimeUnit.SECONDS)).apply(new SumAllWindow()).print();

How to avoid repeated tuples in Flink slide window join?

For example, there are two streams. One is advertisements showed to users. The tuple in which could be described as (advertiseId, showed timestamp). The other one is click stream -- (advertiseId, clicked timestamp). We want get a joined stream, which includes all the advertisement that is clicked by user in 20 minutes after showed. My solution is to join these two streams on a SlidingTimeWindow. But in the joined stream, there are many repeated tuples. How could I get joined tuple only one in new stream?
stream1.join(stream2)
.where(0)
.equalTo(0)
.window(SlidingTimeWindows.of(Time.of(30, TimeUnit.MINUTES), Time.of(10, TimeUnit.SECONDS)))

Solution 1:
Let flink support join two streams on separate windows like Spark streaming. In this case, implement SlidingTimeWindows(21 mins, 1 min) on advertisement stream and TupblingTimeWindows(1 min) on Click stream, then join these two windowed streams.
TupblingTimeWindows could avoid duplicate records in the joined stream.
21 mins size SlidingTimeWindows could avoid missing legal clicks.
One issue is there would be some illegal click(click after 20 mins) in the joined stream. This problem could be fixed easily by adding a filter.
MultiWindowsJoinedStreams<Tuple2<String, Long>, Tuple2<String, Long>> joinedStreams =
new MultiWindowsJoinedStreams<>(advertisement, click);
DataStream<Tuple3<String, Long, Long>> joinedStream = joinedStreams.where(keySelector)
.window(SlidingTimeWindows.of(Time.of(21, TimeUnit.SECONDS), Time.of(1, TimeUnit.SECONDS)))
.equalTo(keySelector)
.window(TumblingTimeWindows.of(Time.of(1, TimeUnit.SECONDS)))
.apply(new JoinFunction<Tuple2<String, Long>, Tuple2<String, Long>, Tuple3<String, Long, Long>>() {
private static final long serialVersionUID = -3625150954096822268L;
#Override
public Tuple3<String, Long, Long> join(Tuple2<String, Long> first, Tuple2<String, Long> second) throws Exception {
return new Tuple3<>(first.f0, first.f1, second.f1);
}
});
joinedStream = joinedStream.filter(new FilterFunction<Tuple3<String, Long, Long>>() {
private static final long serialVersionUID = -4325256210808325338L;
#Override
public boolean filter(Tuple3<String, Long, Long> value) throws Exception {
return value.f1<value.f2&&value.f1+20000>=value.f2;
}
});
Solution 2:
Flink supports join operation without window. A join operator implement the interface TwoInputStreamOperator keeps two buffers(time length based) of these two streams and output one joined stream.
DataStream<Tuple2<String, Long>> advertisement = env
.addSource(new FlinkKafkaConsumer082<String>("advertisement", new SimpleStringSchema(), properties))
.map(new MapFunction<String, Tuple2<String, Long>>() {
private static final long serialVersionUID = -6564495005753073342L;
#Override
public Tuple2<String, Long> map(String value) throws Exception {
String[] splits = value.split(" ");
return new Tuple2<String, Long>(splits[0], Long.parseLong(splits[1]));
}
}).keyBy(keySelector).assignTimestamps(timestampExtractor1);
DataStream<Tuple2<String, Long>> click = env
.addSource(new FlinkKafkaConsumer082<String>("click", new SimpleStringSchema(), properties))
.map(new MapFunction<String, Tuple2<String, Long>>() {
private static final long serialVersionUID = -6564495005753073342L;
#Override
public Tuple2<String, Long> map(String value) throws Exception {
String[] splits = value.split(" ");
return new Tuple2<String, Long>(splits[0], Long.parseLong(splits[1]));
}
}).keyBy(keySelector).assignTimestamps(timestampExtractor2);
NoWindowJoinedStreams<Tuple2<String, Long>, Tuple2<String, Long>> joinedStreams =
new NoWindowJoinedStreams<>(advertisement, click);
DataStream<Tuple3<String, Long, Long>> joinedStream = joinedStreams
.where(keySelector)
.buffer(Time.of(20, TimeUnit.SECONDS))
.equalTo(keySelector)
.buffer(Time.of(5, TimeUnit.SECONDS))
.apply(new JoinFunction<Tuple2<String, Long>, Tuple2<String, Long>, Tuple3<String, Long, Long>>() {
private static final long serialVersionUID = -5075871109025215769L;
#Override
public Tuple3<String, Long, Long> join(Tuple2<String, Long> first, Tuple2<String, Long> second) throws Exception {
return new Tuple3<>(first.f0, first.f1, second.f1);
}
});
I implemented two new join operators base on Flink streaming API TwoInputTransformation. Please check Flink-stream-join. I will add more tests to this repository.

On your code, you defined an overlapping sliding window (slide is smaller than window size). If you don't want to have duplicates you can define a non-overlapping window by only specifying the window size (the default slide is equal to the window size).

While searching for a solution for the same problem, I found the "Interval Join" very useful, which does not repeatedly output the same elements. This is the example from the Flink documentation:
DataStream<Integer> orangeStream = ...
DataStream<Integer> greenStream = ...
orangeStream
.keyBy(<KeySelector>)
.intervalJoin(greenStream.keyBy(<KeySelector>))
.between(Time.milliseconds(-2), Time.milliseconds(1))
.process (new ProcessJoinFunction<Integer, Integer, String(){
#Override
public void processElement(Integer left, Integer right, Context ctx, Collector<String> out) {
out.collect(first + "," + second);
}
});
With this no explicit window has to be defined, instead an interval that is used for each single element like this (image from Flink documentation):