is JSONDeserializationSchema() deprecated in Flink? - apache-flink

I am new to Flink and doing something very similar to the below link.
Cannot see message while sinking kafka stream and cannot see print message in flink 1.2
I am also trying to add JSONDeserializationSchema() as a deserializer for my Kafka input JSON message which is without a key.
But I found JSONDeserializationSchema() is not present.
Please let me know if I am doing anything wrong.

JSONDeserializationSchema was removed in Flink 1.8, after having been deprecated earlier.
The recommended approach is to write a deserializer that implements DeserializationSchema<T>. Here's an example, which I've copied from the Flink Operations Playground:
import org.apache.flink.api.common.serialization.DeserializationSchema;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.ObjectMapper;
import java.io.IOException;
/**
* A Kafka {#link DeserializationSchema} to deserialize {#link ClickEvent}s from JSON.
*
*/
public class ClickEventDeserializationSchema implements DeserializationSchema<ClickEvent> {
private static final long serialVersionUID = 1L;
private static final ObjectMapper objectMapper = new ObjectMapper();
#Override
public ClickEvent deserialize(byte[] message) throws IOException {
return objectMapper.readValue(message, ClickEvent.class);
}
#Override
public boolean isEndOfStream(ClickEvent nextElement) {
return false;
}
#Override
public TypeInformation<ClickEvent> getProducedType() {
return TypeInformation.of(ClickEvent.class);
}
}
For a Kafka producer you'll want to implement KafkaSerializationSchema<T>, and you'll find examples of that in that same project.

To solve the problem of reading non-key JSON messages from Kafka I used case class and JSON parser.
The following code makes a case class and parses the JSON field using play API.
import play.api.libs.json.JsValue
object CustomerModel {
def readElement(jsonElement: JsValue): Customer = {
val id = (jsonElement \ "id").get.toString().toInt
val name = (jsonElement \ "name").get.toString()
Customer(id,name)
}
case class Customer(id: Int, name: String)
}
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val properties = new Properties()
properties.setProperty("bootstrap.servers", "xxx.xxx.0.114:9092")
properties.setProperty("group.id", "test-grp")
val consumer = new FlinkKafkaConsumer[String]("customer", new SimpleStringSchema(), properties)
val stream1 = env.addSource(consumer).rebalance
val stream2:DataStream[Customer]= stream1.map( str =>{Try(CustomerModel.readElement(Json.parse(str))).getOrElse(Customer(0,Try(CustomerModel.readElement(Json.parse(str))).toString))
})
stream2.print("stream2")
env.execute("This is Kafka+Flink")
}
The Try method lets you overcome the exception thrown while parsing the data
and returns the exception in one of the fields (if we want) or else it can just return the case class object with any given or default fields.
The sample output of the Code is:
stream2:1> Customer(1,"Thanh")
stream2:1> Customer(5,"Huy")
stream2:3> Customer(0,Failure(com.fasterxml.jackson.databind.JsonMappingException: No content to map due to end-of-input
at [Source: ; line: 1, column: 0]))
I am not sure if it is the best approach but it is working for me as of now.

Related

Unable to Correctly Serialize RangeSet<Instant> with Flink Serialization System

I've implemented a RichFunction with following type:
RichMapFunction<GeofenceEvent, OutputRangeSet>
the class OutputRangeSet has a field of type:
com.google.common.collect.RangeSet<Instant>
When this pojo is serialized using Kryo I get null fields !
So far, I tried using a TypeInfoFactory<RangeSet>:
public class InstantRangeSetTypeInfo extends TypeInfoFactory<RangeSet<Instant>> {
#Override
public TypeInformation<RangeSet<Instant>> createTypeInfo(Type t, Map<String, TypeInformation<?>> genericParameters) {
TypeInformation<RangeSet<Instant>> info = TypeInformation.of(new TypeHint<RangeSet<Instant>>() {});
return info;
}
}
That annotate my field:
public class OutputRangeSet implements Serializable {
private String key;
#TypeInfo(InstantRangeSetTypeInfo.class)
private RangeSet<Instant> rangeSet;
}
Another solution (that doesn't work either) is registring a third party serializer:
env.getConfig().registerTypeWithKryoSerializer(RangeSet.class, ProtobufSerializer.class);
You can get the github project here:
https://github.com/elarbikonta/tonl-events
When you run the test you can see (in debug) that the rangeSet beans I get from my RichFunction has null fields, see test method com.tonl.apps.events.IsVehicleInZoneTest#operatorChronograph :
final RangeSet<Instant> rangeSet = resultList.get(0).getRangeSet(); // rangetSet.ranges = null !
Thanks for your help

Flink JDBC Sink part 2

I have posted a question few days back- Flink Jdbc sink
Now, I am trying to use the sink provided by flink.
I have written the code and it worked as well. But nothing got saved in DB and no exceptions were there. Using previous sink my code was not finishing(that should happen ideally as its a streaming app) but after the following code I am getting no error and the nothing is getting saved to DB.
public class CompetitorPipeline implements Pipeline {
private final StreamExecutionEnvironment streamEnv;
private final ParameterTool parameter;
private static final Logger LOG = LoggerFactory.getLogger(CompetitorPipeline.class);
public CompetitorPipeline(StreamExecutionEnvironment streamEnv, ParameterTool parameter) {
this.streamEnv = streamEnv;
this.parameter = parameter;
}
#Override
public KeyedStream<CompetitorConfig, String> start(ParameterTool parameter) throws Exception {
CompetitorConfigChanges competitorConfigChanges = new CompetitorConfigChanges();
KeyedStream<CompetitorConfig, String> competitorChangesStream = competitorConfigChanges.run(streamEnv, parameter);
//Add to JBDC Sink
competitorChangesStream.addSink(JdbcSink.sink(
"insert into competitor_config_universe(marketplace_id,merchant_id, competitor_name, comp_gl_product_group_desc," +
"category_code, competitor_type, namespace, qualifier, matching_type," +
"zip_region, zip_code, competitor_state, version_time, compConfigTombstoned, last_updated) values (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)",
(ps, t) -> {
ps.setInt(1, t.getMarketplaceId());
ps.setLong(2, t.getMerchantId());
ps.setString(3, t.getCompetitorName());
ps.setString(4, t.getCompGlProductGroupDesc());
ps.setString(5, t.getCategoryCode());
ps.setString(6, t.getCompetitorType());
ps.setString(7, t.getNamespace());
ps.setString(8, t.getQualifier());
ps.setString(9, t.getMatchingType());
ps.setString(10, t.getZipRegion());
ps.setString(11, t.getZipCode());
ps.setString(12, t.getCompetitorState());
ps.setTimestamp(13, Timestamp.valueOf(t.getVersionTime()));
ps.setBoolean(14, t.isCompConfigTombstoned());
ps.setTimestamp(15, new Timestamp(System.currentTimeMillis()));
System.out.println("sql"+ps);
},
new JdbcConnectionOptions.JdbcConnectionOptionsBuilder()
.withUrl("jdbc:mysql://127.0.0.1:3306/database")
.withDriverName("com.mysql.cj.jdbc.Driver")
.withUsername("xyz")
.withPassword("xyz#")
.build()));
return competitorChangesStream;
}
}
You need enable autocommit mode for jdbc Sink.
new JdbcConnectionOptions.JdbcConnectionOptionsBuilder()
.withUrl("jdbc:mysql://127.0.0.1:3306/database;autocommit=true")
It looks like SimpleBatchStatementExecutor only works in auto-commit mode. And if you need to commit and rollback batches, then you have to write your own ** JdbcBatchStatementExecutor **
Have you tried to include the JdbcExecutionOptions ?
dataStream.addSink(JdbcSink.sink(
sql_statement,
(statement, value) -> {
/* Prepared Statement */
},
JdbcExecutionOptions.builder()
.withBatchSize(5000)
.withBatchIntervalMs(200)
.withMaxRetries(2)
.build(),
new JdbcConnectionOptions.JdbcConnectionOptionsBuilder()
.withUrl("jdbc:mysql://127.0.0.1:3306/database")
.withDriverName("com.mysql.cj.jdbc.Driver")
.withUsername("xyz")
.withPassword("xyz#")
.build()));

Flink DataStream[String] kafkaconsumer convert to Avro for Sink

FLINK Streaming: I have DataStream[String] from kafkaconsumer which is
JSON
stream = env
.addSource(new FlinkKafkaConsumer[String]("topic", new SimpleStringSchema(), properties))
https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kafka.html
I have to sink this stream using StreamingFileSink, which needs DataStream[GenericRecord]
val schema: Schema = ...
val input: DataStream[GenericRecord] = ...
val sink: StreamingFileSink[GenericRecord] = StreamingFileSink
.forBulkFormat(outputBasePath, AvroWriters.forGenericRecord(schema))
.build()
input.addSink(sink)
https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/connectors/streamfile_sink.html
Question: How to convert DataStream[String] to DataStream[GenericRecord] before Sinking so that I can write AVRO files ?
Exception while converting String stream to generic data strem
Exception in thread "main" org.apache.flink.api.common.InvalidProgramException: Task not serializable
at org.apache.flink.api.scala.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:408)
at org.apache.flink.api.scala.ClosureCleaner$.org$apache$flink$api$scala$ClosureCleaner$$clean(ClosureCleaner.scala:400)
at org.apache.flink.api.scala.ClosureCleaner$.clean(ClosureCleaner.scala:168)
at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.scalaClean(StreamExecutionEnvironment.scala:791)
at org.apache.flink.streaming.api.scala.DataStream.clean(DataStream.scala:1168)
at org.apache.flink.streaming.api.scala.DataStream.map(DataStream.scala:617)
at com.att.vdcs.StreamingJobKafkaFlink$.main(StreamingJobKafkaFlink.scala:128)
at com.att.vdcs.StreamingJobKafkaFlink.main(StreamingJobKafkaFlink.scala)
Caused by: java.io.NotSerializableException: org.apache.avro.Schema$RecordSchema
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at org.apache.flink.util.InstantiationUtil.serializeObject(InstantiationUtil.java:586)
at org.apache.flink.api.scala.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:406)
... 7 more
After initializing schema in mapper, Getting cast exception.
org.apache.avro.file.DataFileWriter$AppendWriteException: java.lang.ClassCastException: scala.Tuple2 cannot be cast to java.util.Map
schema and msg below in screen:
Got through CAST Exception by casting like:
record.put(0,scala.collection.JavaConverters.mapAsJavaMapConverter(msg._1).asJava)
Now streaming is working good Except there are extra Escape Characters added
,"body":"\"{\\\"hdr\\\":{\\\"mes
there are extra escape \
it should be like:
,"body":"\"{\"hdr\":{\"mes
extra escape was removed after changing toString to getAsString
Now its working as expected.
Need to try SNAPPY compression of stream next.
You need to transform your stream of Strings into a stream of GenericRecords, for example using a .map() function.
Example:
DataStream<String> strings = env.addSource( ... );
DataStream<GenericRecord> records = strings.map(inputStr -> {
GenericData.Record rec = new GenericData.Record(schema);
rec.put(0, inputStr);
return rec;
});
Please note that using GenericRecord can lead to a poor performance, because the schema needs to be serialized with each record over and over again.
It is better to generate an Avro Pojo, as it won't need to ship the schema.
In java, you should use a RichMapFunction to convert DataStream to DataStream and add a transient Schema field to generate GenericRecord. But i dont know how to do this in scala, just for reference.
DataStream<GenericRecord> records = maps.map(new RichMapFunction<Map<String, Object>, GenericRecord>() {
private transient DatumWriter<IndexedRecord> datumWriter;
/**
* Output stream to serialize records into byte array.
*/
private transient ByteArrayOutputStream arrayOutputStream;
/**
* Low-level class for serialization of Avro values.
*/
private transient Encoder encoder;
/**
* Avro serialization schema.
*/
private transient Schema schema;
#Override
public GenericRecord map(Map<String, Object> stringObjectMap) throws Exception {
GenericRecord record = new GenericData.Record(schema);
stringObjectMap.entrySet().forEach(entry->{record.put(entry.getKey(), entry.getValue());});
return record;
}
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
this.arrayOutputStream = new ByteArrayOutputStream();
this.encoder = EncoderFactory.get().binaryEncoder(arrayOutputStream, null);
this.datumWriter = new GenericDatumWriter<>(schema);
try {
this.schema = new Schema.Parser().parse(avroSchemaString);
} catch (SchemaParseException e) {
throw new IllegalArgumentException("Could not parse Avro schema string.", e);
}
}
});
final StreamingFileSink<GenericRecord> sink = StreamingFileSink
.forBulkFormat(new Path("D:\\test"), AvroWriters.forGenericRecord(mongoSchema))
.build();
records.addSink(sink);

To convert object node to json node

Here DataStream returns keyvalue pair as a object i need key value directly not as a object becoz i need to group the values based on key.
DataStream<ObjectNode> stream = env
.addSource(new FlinkKafkaConsumer<>("test5", new JSONKeyValueDeserializationSchema (false), properties));
// stream.keyBy("record1").print();
when i give stream.keyby("record1").print();
it shows
Exception in thread "main" org.apache.flink.api.common.InvalidProgramException: This type (GenericType<org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.node.ObjectNode>) cannot be used as key.
at org.apache.flink.api.common.operators.Keys$ExpressionKeys.<init>(Keys.java:330)
at org.apache.flink.streaming.api.datastream.DataStream.keyBy(DataStream.java:337)
at ReadFromKafka.main(ReadFromKafka.java:27)
David Anderson's response is correct, as an addition, I can add that You can simply create the KeySelector that will extract the key as String. It could look like this:
public class JsonKeySelector implements KeySelector<ObjectNode, String> {
#Override
public String getKey(ObjectNode jsonNodes) throws Exception {
return jsonNodes.get("key").asText();
}
}
This obviously assumes that the Key is supposed to be String.
There are several ways to specify the key selector in a Flink keyBy. For example, if you have a POJO of type Event with the String key in a field named "id", any of these will work:
stream.keyBy("id")
stream.keyBy(event -> event.id)
stream.keyBy(
new KeySelector<Event, String>() {
#Override
public String getKey(Event event) throws Exception {
return event.id;
}
}
)
So long as you can compute the key from the object in a deterministic way, you can make this work.

How to handle exception while parsing JSON in Flink

I am reading data from Kafka using flink 1.4.2 and parsing them to ObjectNode using JSONDeserializationSchema. If the incoming record is not a valid JSON then my Flink job fails. I would like to skip the broken record instead of failing the job.
FlinkKafkaConsumer010<ObjectNode> kafkaConsumer =
new FlinkKafkaConsumer010<>(TOPIC, new JSONDeserializationSchema(), consumerProperties);
DataStream<ObjectNode> messageStream = env.addSource(kafkaConsumer);
messageStream.print();
I am getting the following exception if the data in Kafka is not a valid JSON.
Job execution switched to status FAILING.
org.apache.flink.shaded.jackson2.com.fasterxml.jackson.core.JsonParseException: Unrecognized token 'This': was expecting ('true', 'false' or 'null')
at [Source: [B#4f522623; line: 1, column: 6]
Job execution switched to status FAILED.
Exception in thread "main" org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
The easiest solution is to implement your own DeserializationSchema and wrap JSONDeserializationSchema. You can then catch the exception and either ignore it or perform custom action.
As suggested by #twalthr, I implemented my own DeserializationSchema by copying JSONDeserializationSchema and added exception handling.
import org.apache.flink.api.common.serialization.AbstractDeserializationSchema;
import org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.ObjectMapper;
import org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.node.ObjectNode;
import java.io.IOException;
public class CustomJSONDeserializationSchema extends AbstractDeserializationSchema<ObjectNode> {
private ObjectMapper mapper;
#Override
public ObjectNode deserialize(byte[] message) throws IOException {
if (mapper == null) {
mapper = new ObjectMapper();
}
ObjectNode objectNode;
try {
objectNode = mapper.readValue(message, ObjectNode.class);
} catch (Exception e) {
ObjectMapper errorMapper = new ObjectMapper();
ObjectNode errorObjectNode = errorMapper.createObjectNode();
errorObjectNode.put("jsonParseError", new String(message));
objectNode = errorObjectNode;
}
return objectNode;
}
#Override
public boolean isEndOfStream(ObjectNode nextElement) {
return false;
}
}
In my streaming job.
messageStream
.filter((event) -> {
if(event.has("jsonParseError")) {
LOG.warn("JsonParseException was handled: " + event.get("jsonParseError").asText());
return false;
}
return true;
}).print();
Flink has improved null record handling for FlinkKafkaConsumer
There are two possible design choices when the DeserializationSchema encounters a corrupted message. It can either throw an IOException which causes the pipeline to be restarted, or it can return null where the Flink Kafka consumer will silently skip the corrupted message.
For more details, you can see this link.

Resources