How can we define event time and process time , while defining table from data stream on flink over version 1.16 - apache-flink

At Flink 1.16 version , StreamTableEnvironment.fromDataStream(DataStream dataStream, Expression... fields) method is depricated.
At previous versions, it can be defined by using expressions event time and process time by methods
.rowtime() ,
.proctime() ,
as defined in https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/concepts/time_attributes/
What is the right methodology in newly releases?
At previous version we can define table from stream as
Table transactionTable = tableEnv.fromDataStream(transactionDataStream,$("Field1"),$("field2"),$("field3")
,$("transactionTime").rowtime(),$("ts").proctime());
but the method StreamTableEnvironment.fromDataStream(DataStream dataStream, Expression... fields) is depricated after version 1.15

You should now use <T> Table fromDataStream(DataStream<T> dataStream, Schema schema) instead, where the Schema will be something like this:
Schema.newBuilder()
.column("field1", DataTypes.INT())
.column("transactionTime", DataTypes.TIMESTAMP_LTZ(3))
.columnByExpression("ts", "PROCTIME()")
.watermark("transactionTime", "transactionTime - INTERVAL '10' SECOND"")
.build();
There's more information, and more examples, in the documentation.

Related

Is there any way we can parse s string expression in Apache Flink Table API?

I am trying to perform aggregation using Flink Table API by accepting group by field and field aggregation expressions as string parameters from the user.
Input
GroupBy Field = department
aggregation Field Expression = count(employeeId), max(salary)
Is there any way we can do it using flink Table API? I tried to do the following, but it didn't help. Does flink have anything equivalent to selectExpr function in spark?
https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.DataFrame.selectExpr.html
employeeTable
.groupBy($("department"))
.select(
$("department"),
$("count(employeeId)").as("numberOfEmployees"),
$("max(salary)").as("maxSalary")
)
It is throwing the following exception
Exception in thread "main" org.apache.flink.table.api.ValidationException: Cannot resolve field [count(employeeId)], input field list:[department].
No, I don't believe this will work. Flink's SQL planner wants to know what the query is doing at compile time.
What you can do is construct a SQL query and create a new job to run that query. The SQL gateway that's coming in Flink 1.16 (see FLIP-91) should make this easier.
I think you have wrong syntax.
.select(
$("department"),
$("count(employeeId)").as("numberOfEmployees"),
$("max(salary)").as("maxSalary")
)
count and max you should call like this :
$("employeeId").count().as("numberOfEmployees"),
$("salary").max().as("maxSalary")
You can check Built-in functions here

the policy of new watermark generation when defined in DDL

In the https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/table/streaming/time_attributes.html
The DDL for the event time attribute and watermark is:
CREATE TABLE user_actions (
user_name STRING,
data STRING,
user_action_time TIMESTAMP(3),
-- declare user_action_time as event time attribute and use 5 seconds delayed watermark strategy
WATERMARK FOR user_action_time AS user_action_time - INTERVAL '5' SECOND
) WITH (
...
);
I would ask the policy of new watermark generation:
With data stream, flink provides following two policies for watermark generation,
what about in ddl?
periodically like AssignerWithPeriodicWatermarks does,that is, try to generate new watermark periodically
punctuated like AssignerWithPunctuatedWatermarks, that is,try to generate new watermark when new event comes.
The watermark is periodically assigned. You can specify the interval via the configuration pipeline.auto-watermark-interval.
Also note, that the API for Watermarks was changed in the DataStream API and the two classes you mention are deprecated by now.
[1] https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/event_timestamps_watermarks.html#introduction-to-watermark-strategies

kafka flink timestamp Event time and watermark

I am reading the book Stream Processing with Apache Flink and it is stated that “As of version 0.10.0, Kafka supports message timestamps. When reading from Kafka version 0.10 or later, the consumer will automatically extract the message timestamp as an event-time timestamp if the application runs in event-time mode*”
So inside a processElement function the call context.timestamp() will by default return the kafka message timestamp?
Coul you please provide a simple example on how to implement AssignerWithPeriodicWatermarks/AssignerWithPunctuatedWatermarks that extract (and builds watermarks) based on the consumed kafka message timestamp.
If I am using TimeCharacteristic.ProcessingTime, would ctx.timestamp() return the processing time and in such case would it be similar to context.timerService().currentProcessingTime() .
Thank you.
The Flink Kafka consumer takes care of this for you, and puts the timestamp where it needs to be. In Flink 1.11 you can simply rely on this, though you still need to take care of providing a WatermarkStrategy that specifies the out-of-orderness (or asserts that the timestamps are in order):
FlinkKafkaConsumer<String> myConsumer = new FlinkKafkaConsumer<>(...);
myConsumer.assignTimestampsAndWatermarks(
WatermarkStrategy.
.forBoundedOutOfOrderness(Duration.ofSeconds(20)));
In earlier versions of Flink you had to provide an implementation of a timestamp assigner, which would look like this:
public long extractTimestamp(Long element, long previousElementTimestamp) {
return previousElementTimestamp;
}
This version of the extractTimestamp method is passed the current value of the timestamp present in the StreamRecord as previousElementTimestamp, which in this case will be the timestamp put there by the Flink Kafka consumer.
Flink 1.11 docs
Flink 1.10 docs
As for what is returned by ctx.timestamp() when using TimeCharacteristic.ProcessingTime, this method returns NULL in that case. (Semantically, yes, it is as though the timestamp is the current processing time, but that's not how it's implemented.)

Flink+Kafka 0.10: How to create a Table with the Kafka message timestamp as field?

I would like to extract the timestamp of the messages that are produced by FlinkKafkaConsumer010 as values in the data stream.
I am aware of the AssignerWithPeriodicWatermarks class, but this seems to only extract the timestamp for the purposes of time aggregates via the DataStream API.
I would like to make that Kafka message timestamp available in a Table so later on, I can use SQL on it.
EDIT: Tried this:
val consumer = new FlinkKafkaConsumer010("test", new SimpleStringSchema, properties)
consumer.setStartFromEarliest()
val env = StreamExecutionEnvironment.getExecutionEnvironment
val tenv = TableEnvironment.getTableEnvironment(env)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
class KafkaAssigner[T] extends AssignerWithPeriodicWatermarks[T] {
var maxTs = 0L
override def extractTimestamp(element: T, previousElementTimestamp: Long): Long = {
maxTs = Math.max(maxTs, previousElementTimestamp)
previousElementTimestamp
}
override def getCurrentWatermark: Watermark = new Watermark(maxTs - 1L)
}
val stream = env
.addSource(consumer)
.assignTimestampsAndWatermarks(new KafkaAssigner[String])
.flatMap(_.split("\\W+"))
val tbl = tenv.fromDataStream(stream, 'w, 'ts.rowtime)
It compiles, but throws:
Exception in thread "main" org.apache.flink.table.api.TableException: Field reference expression requested.
at org.apache.flink.table.api.TableEnvironment$$anonfun$1.apply(TableEnvironment.scala:630)
at org.apache.flink.table.api.TableEnvironment$$anonfun$1.apply(TableEnvironment.scala:624)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)
at org.apache.flink.table.api.TableEnvironment.getFieldInfo(TableEnvironment.scala:624)
at org.apache.flink.table.api.StreamTableEnvironment.registerDataStreamInternal(StreamTableEnvironment.scala:398)
at org.apache.flink.table.api.scala.StreamTableEnvironment.fromDataStream(StreamTableEnvironment.scala:85)
at the very last line of the above code.
EDIT2: Thanks to #fabian-hueske for pointing me to a workaround. Full code at https://github.com/andrey-savov/flink-kafka
Flink's Kafka 0.10 consumer automatically sets the timestamp of a Kafka message as the event-time timestamp of produced records if the time characteristic EventTime is configured (see docs).
After you have ingested the Kafka topic into a DataStream with timestamps (still not visible) and watermarks assigned, you can convert it into a Table with the StreamTableEnvironment.fromDataStream(stream, fieldExpr*) method. The fieldExpr* parameter is a list of expressions that describe the schema of the generated table. You can add a field that holds the record timestamp of the stream with an expression mytime.rowtime, where mytime is the name of the new field and rowtime indicates that the value is extracted from the record timestamp. Please check the docs for details.
NOTE: As #bfair pointed out, the conversion of a DataStream of an atomic type (such as DataStream[String]) fails with an exception in Flink 1.3.2 and earlier versions. The bug has been reported as FLINK-7939 and will be fixed in the next versions.

HiveQL to HBase

I am using Hive 0.14 and Hbase 0.98.8
I would like to use HiveQL for accessing a HBase "table".
I created a table with a complex composite rowkey:
CREATE EXTERNAL TABLE db.hive_hbase (rowkey struct<p1:string, p2:string, p3:string>, column1 string, column2 string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY ';'
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" =
":key,cf:c1,cf:c2")
TBLPROPERTIES("hbase.table.name"="hbase_table");
The table is getting successfully created, but the HiveQL is taking forever:
SELECT * from db.hive_hbase WHERE rowkey.p1 = 'xyz';
Queries without using the rowkey are fine and also using the hbase shell with filters are working.
I don't find anything in the logs, but I assume that there could be an issue with complex composite keys and performance.
Did anybody face the same issue? Hints to solve it? Other ideas, what I could try?
Thank you
Update 16.07.15:
I changed the log4j properties to 'DEBUG' and found some interesting information:
It says:
2015-07-15 15:56:41,232 INFO ppd.OpProcFactory (OpProcFactory.java:logExpr(823)) - Pushdown Predicates of FIL For Alias : hive_hbase
2015-07-15 15:56:41,232 INFO ppd.OpProcFactory (OpProcFactory.java:logExpr(826)) - (rowkey.p1 = 'xyz')
But some lines later:
2015-07-15 15:56:41,430 DEBUG ppd.OpProcFactory (OpProcFactory.java:pushFilterToStorageHandler(1051)) - No pushdown possible for predicate: (rowkey.p1 = 'xyz')
So my guess is: HiveQL over HBase does not do any predicate pushdown in Hbase but rather starts a MapReduce job.
Could there be a bug with the predicate pushdown?
I tried similar situation using Hive 0.13 and it works fine. I got the result. What version of hive are you working on?

Resources