Flink SQL: How can use a Long type column to Rowtime - apache-flink

Flink1.9.1
I read a csv file. I want to use a long type column to TUMBLE.
I use UDF transfer Long type to Timestamp type,but is can't work
error message: Window can only be defined over a time attribute column.
I try to debug. TimeIndicatorRelDataType is not Timestamp,I don't know how to transfer and why?
def isTimeIndicatorType(relDataType: RelDataType): Boolean = relDataType match {
case ti: TimeIndicatorRelDataType => true
case _ => false
}
CODE
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
env.setParallelism(1);
// read csv
URL fileUrl = HotItemsSql.class.getClassLoader().getResource("UserBehavior-less.csv");
CsvTableSource csvTableSource = CsvTableSource.builder().path(fileUrl.getPath())
.field("userId", BasicTypeInfo.LONG_TYPE_INFO)
.field("itemId", BasicTypeInfo.LONG_TYPE_INFO)
.field("categoryId", BasicTypeInfo.LONG_TYPE_INFO)
.field("behavior", BasicTypeInfo.LONG_TYPE_INFO)
.field("optime", BasicTypeInfo.LONG_TYPE_INFO)
.build();
// trans to stream
DataStream<Row> csvDataStream=csvTableSource.getDataStream(env).assignTimestampsAndWatermarks(new AscendingTimestampExtractor<Row>() {
#Override
public long extractAscendingTimestamp(Row element) {
return Timestamp.valueOf(element.getField(5).toString()).getTime();
}
}).broadcast();
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);
tableEnv.registerDataStream("T_UserBehavior",csvDataStream,"userId,itemId,categoryId,behavior,optime");
tableEnv.registerFunction("Long2DateTime",new DateTransFunction());
Table result = tableEnv.sqlQuery("select userId," +
"TUMBLE_START(Long2DateTime(optime), INTERVAL '10' SECOND) as window_start," +
"TUMBLE_END(Long2DateTime(optime), INTERVAL '10' SECOND) as window_end " +
"from T_UserBehavior " +
"group by TUMBLE(Long2DateTime(optime),INTERVAL '10' SECOND),userId");
tableEnv.toRetractStream(result, Row.class).print();
UDF
import java.sql.Timestamp;
public class DateTransFunction extends ScalarFunction {
public Timestamp eval(Long longTime) {
try {
Timestamp t = new Timestamp(longTime);
return t;
} catch (Exception e) {
return null;
}
}
}
error stack
Exception in thread "main" org.apache.flink.table.api.ValidationException: Window can only be defined over a time attribute column.
at org.apache.flink.table.plan.rules.datastream.DataStreamLogicalWindowAggregateRule.getOperandAsTimeIndicator$1(DataStreamLogicalWindowAggregateRule.scala:85)
at org.apache.flink.table.plan.rules.datastream.DataStreamLogicalWindowAggregateRule.translateWindowExpression(DataStreamLogicalWindowAggregateRule.scala:90)
at org.apache.flink.table.plan.rules.common.LogicalWindowAggregateRule.onMatch(LogicalWindowAggregateRule.scala:68)
at org.apache.calcite.plan.AbstractRelOptPlanner.fireRule(AbstractRelOptPlanner.java:319)
at org.apache.calcite.plan.hep.HepPlanner.applyRule(HepPlanner.java:560)
at org.apache.calcite.plan.hep.HepPlanner.applyRules(HepPlanner.java:419)
at org.apache.calcite.plan.hep.HepPlanner.executeInstruction(HepPlanner.java:256)
at org.apache.calcite.plan.hep.HepInstruction$RuleInstance.execute(HepInstruction.java:127)
at org.apache.calcite.plan.hep.HepPlanner.executeProgram(HepPlanner.java:215)
at org.apache.calcite.plan.hep.HepPlanner.findBestExp(HepPlanner.java:202)
at org.apache.flink.table.plan.Optimizer.runHepPlanner(Optimizer.scala:228)
at org.apache.flink.table.plan.Optimizer.runHepPlannerSequentially(Optimizer.scala:194)
at org.apache.flink.table.plan.Optimizer.optimizeNormalizeLogicalPlan(Optimizer.scala:150)
at org.apache.flink.table.plan.StreamOptimizer.optimize(StreamOptimizer.scala:65)
at org.apache.flink.table.planner.StreamPlanner.translateToType(StreamPlanner.scala:410)
at org.apache.flink.table.planner.StreamPlanner.org$apache$flink$table$planner$StreamPlanner$$translate(StreamPlanner.scala:182)

Since you already managed to assign a timestamp in DataStream API, you should be able to call:
tableEnv.registerDataStream(
"T_UserBehavior",
csvDataStream,
"userId, itemId, categoryId, behavior, rt.rowtime");
The .rowtime instructs the API to create column with the timestamp stored in every stream record coming from DataStream API.
The community is currently working on making your program easier. In Flink 1.10 you should be able to define your CSV with rowtime table directly in a SQL DDL.

Related

Flink JDBC Sink part 2

I have posted a question few days back- Flink Jdbc sink
Now, I am trying to use the sink provided by flink.
I have written the code and it worked as well. But nothing got saved in DB and no exceptions were there. Using previous sink my code was not finishing(that should happen ideally as its a streaming app) but after the following code I am getting no error and the nothing is getting saved to DB.
public class CompetitorPipeline implements Pipeline {
private final StreamExecutionEnvironment streamEnv;
private final ParameterTool parameter;
private static final Logger LOG = LoggerFactory.getLogger(CompetitorPipeline.class);
public CompetitorPipeline(StreamExecutionEnvironment streamEnv, ParameterTool parameter) {
this.streamEnv = streamEnv;
this.parameter = parameter;
}
#Override
public KeyedStream<CompetitorConfig, String> start(ParameterTool parameter) throws Exception {
CompetitorConfigChanges competitorConfigChanges = new CompetitorConfigChanges();
KeyedStream<CompetitorConfig, String> competitorChangesStream = competitorConfigChanges.run(streamEnv, parameter);
//Add to JBDC Sink
competitorChangesStream.addSink(JdbcSink.sink(
"insert into competitor_config_universe(marketplace_id,merchant_id, competitor_name, comp_gl_product_group_desc," +
"category_code, competitor_type, namespace, qualifier, matching_type," +
"zip_region, zip_code, competitor_state, version_time, compConfigTombstoned, last_updated) values (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)",
(ps, t) -> {
ps.setInt(1, t.getMarketplaceId());
ps.setLong(2, t.getMerchantId());
ps.setString(3, t.getCompetitorName());
ps.setString(4, t.getCompGlProductGroupDesc());
ps.setString(5, t.getCategoryCode());
ps.setString(6, t.getCompetitorType());
ps.setString(7, t.getNamespace());
ps.setString(8, t.getQualifier());
ps.setString(9, t.getMatchingType());
ps.setString(10, t.getZipRegion());
ps.setString(11, t.getZipCode());
ps.setString(12, t.getCompetitorState());
ps.setTimestamp(13, Timestamp.valueOf(t.getVersionTime()));
ps.setBoolean(14, t.isCompConfigTombstoned());
ps.setTimestamp(15, new Timestamp(System.currentTimeMillis()));
System.out.println("sql"+ps);
},
new JdbcConnectionOptions.JdbcConnectionOptionsBuilder()
.withUrl("jdbc:mysql://127.0.0.1:3306/database")
.withDriverName("com.mysql.cj.jdbc.Driver")
.withUsername("xyz")
.withPassword("xyz#")
.build()));
return competitorChangesStream;
}
}
You need enable autocommit mode for jdbc Sink.
new JdbcConnectionOptions.JdbcConnectionOptionsBuilder()
.withUrl("jdbc:mysql://127.0.0.1:3306/database;autocommit=true")
It looks like SimpleBatchStatementExecutor only works in auto-commit mode. And if you need to commit and rollback batches, then you have to write your own ** JdbcBatchStatementExecutor **
Have you tried to include the JdbcExecutionOptions ?
dataStream.addSink(JdbcSink.sink(
sql_statement,
(statement, value) -> {
/* Prepared Statement */
},
JdbcExecutionOptions.builder()
.withBatchSize(5000)
.withBatchIntervalMs(200)
.withMaxRetries(2)
.build(),
new JdbcConnectionOptions.JdbcConnectionOptionsBuilder()
.withUrl("jdbc:mysql://127.0.0.1:3306/database")
.withDriverName("com.mysql.cj.jdbc.Driver")
.withUsername("xyz")
.withPassword("xyz#")
.build()));

Apache Flink EventTime processing not working

I am trying to perform stream-stream join using Flink v1.11 app on KDA. Join wrt to ProcessingTime works, but with EventTime I don’t see any output records from Flink.
Here is my code with EventTime processing which is not working,
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
DataStream<Trade> input1 = createSourceFromInputStreamName1(env)
.assignTimestampsAndWatermarks(
WatermarkStrategy.<Trade>forMonotonousTimestamps()
.withTimestampAssigner(((event, l) -> event.getEventTime()))
);
DataStream<Company> input2 = createSourceFromInputStreamName2(env)
.assignTimestampsAndWatermarks(
WatermarkStrategy.<Company>forMonotonousTimestamps()
.withTimestampAssigner(((event, l) -> event.getEventTime()))
);
DataStream<String> joinedStream = input1.join(input2)
.where(new TradeKeySelector())
.equalTo(new CompanyKeySelector())
.window(TumblingEventTimeWindows.of(Time.seconds(30)))
.apply(new JoinFunction<Trade, Company, String>() {
#Override
public String join(Trade t, Company c) {
return t.getEventTime() + ", " + t.getTicker() + ", " + c.getName() + ", " + t.getPrice();
}
});
joinedStream.addSink(createS3SinkFromStaticConfig());
env.execute("Flink S3 Streaming Sink Job");
}
I got a similar join working with ProcessingTime
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime);
DataStream<Trade> input1 = createSourceFromInputStreamName1(env);
DataStream<Company> input2 = createSourceFromInputStreamName2(env);
DataStream<String> joinedStream = input1.join(input2)
.where(new TradeKeySelector())
.equalTo(new CompanyKeySelector())
.window(TumblingProcessingTimeWindows.of(Time.milliseconds(10000)))
.apply (new JoinFunction<Trade, Company, String> (){
#Override
public String join(Trade t, Company c) {
return t.getEventTime() + ", " + t.getTicker() + ", " + c.getName() + ", " + t.getPrice();
}
});
joinedStream.addSink(createS3SinkFromStaticConfig());
env.execute("Flink S3 Streaming Sink Job");
}
Sample records from two streams which I am trying to join:
{'eventTime': 1611773705, 'ticker': 'TBV', 'price': 71.5}
{'eventTime': 1611773705, 'ticker': 'TBV', 'name': 'The Bavaria'}
I don't see anything obviously wrong, but any of the following could cause this job to not produce any output:
A problem with watermarking. For example, if one of the streams becomes idle, then the watermarks will cease to advance. Or if there are no events after a window, then the watermark will not advance far enough to close that window. Or if the timestamps aren't actually in ascending order (with the forMonotonousTimestamps strategy, the events should be in order by timestamp), the pipeline could be silently dropping all of the out-of-order events.
The StreamingFileSink only finalizes its output during checkpointing, and does not finalize whatever files are pending if and when the job is stopped.
A windowed join behaves like an inner join, and requires at least one event from each input stream in order to produce any results for a given window interval. From the example you shared, it looks like this is not the issue.
Update:
Given that what you (appear to) want to do is to join each Trade with the latest Company record available at the time of the Trade, a lookup join or a temporal table join seem like they might be good approaches.
Here are a couple of examples:
https://github.com/ververica/flink-sql-cookbook/blob/master/joins/04/04_lookup_joins.md
https://github.com/ververica/flink-sql-cookbook/blob/master/joins/03/03_kafka_join.md
Some documentation:
https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/joins.html#event-time-temporal-join
https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/versioned_tables.html

Flink DataStream[String] kafkaconsumer convert to Avro for Sink

FLINK Streaming: I have DataStream[String] from kafkaconsumer which is
JSON
stream = env
.addSource(new FlinkKafkaConsumer[String]("topic", new SimpleStringSchema(), properties))
https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kafka.html
I have to sink this stream using StreamingFileSink, which needs DataStream[GenericRecord]
val schema: Schema = ...
val input: DataStream[GenericRecord] = ...
val sink: StreamingFileSink[GenericRecord] = StreamingFileSink
.forBulkFormat(outputBasePath, AvroWriters.forGenericRecord(schema))
.build()
input.addSink(sink)
https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/connectors/streamfile_sink.html
Question: How to convert DataStream[String] to DataStream[GenericRecord] before Sinking so that I can write AVRO files ?
Exception while converting String stream to generic data strem
Exception in thread "main" org.apache.flink.api.common.InvalidProgramException: Task not serializable
at org.apache.flink.api.scala.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:408)
at org.apache.flink.api.scala.ClosureCleaner$.org$apache$flink$api$scala$ClosureCleaner$$clean(ClosureCleaner.scala:400)
at org.apache.flink.api.scala.ClosureCleaner$.clean(ClosureCleaner.scala:168)
at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.scalaClean(StreamExecutionEnvironment.scala:791)
at org.apache.flink.streaming.api.scala.DataStream.clean(DataStream.scala:1168)
at org.apache.flink.streaming.api.scala.DataStream.map(DataStream.scala:617)
at com.att.vdcs.StreamingJobKafkaFlink$.main(StreamingJobKafkaFlink.scala:128)
at com.att.vdcs.StreamingJobKafkaFlink.main(StreamingJobKafkaFlink.scala)
Caused by: java.io.NotSerializableException: org.apache.avro.Schema$RecordSchema
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at org.apache.flink.util.InstantiationUtil.serializeObject(InstantiationUtil.java:586)
at org.apache.flink.api.scala.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:406)
... 7 more
After initializing schema in mapper, Getting cast exception.
org.apache.avro.file.DataFileWriter$AppendWriteException: java.lang.ClassCastException: scala.Tuple2 cannot be cast to java.util.Map
schema and msg below in screen:
Got through CAST Exception by casting like:
record.put(0,scala.collection.JavaConverters.mapAsJavaMapConverter(msg._1).asJava)
Now streaming is working good Except there are extra Escape Characters added
,"body":"\"{\\\"hdr\\\":{\\\"mes
there are extra escape \
it should be like:
,"body":"\"{\"hdr\":{\"mes
extra escape was removed after changing toString to getAsString
Now its working as expected.
Need to try SNAPPY compression of stream next.
You need to transform your stream of Strings into a stream of GenericRecords, for example using a .map() function.
Example:
DataStream<String> strings = env.addSource( ... );
DataStream<GenericRecord> records = strings.map(inputStr -> {
GenericData.Record rec = new GenericData.Record(schema);
rec.put(0, inputStr);
return rec;
});
Please note that using GenericRecord can lead to a poor performance, because the schema needs to be serialized with each record over and over again.
It is better to generate an Avro Pojo, as it won't need to ship the schema.
In java, you should use a RichMapFunction to convert DataStream to DataStream and add a transient Schema field to generate GenericRecord. But i dont know how to do this in scala, just for reference.
DataStream<GenericRecord> records = maps.map(new RichMapFunction<Map<String, Object>, GenericRecord>() {
private transient DatumWriter<IndexedRecord> datumWriter;
/**
* Output stream to serialize records into byte array.
*/
private transient ByteArrayOutputStream arrayOutputStream;
/**
* Low-level class for serialization of Avro values.
*/
private transient Encoder encoder;
/**
* Avro serialization schema.
*/
private transient Schema schema;
#Override
public GenericRecord map(Map<String, Object> stringObjectMap) throws Exception {
GenericRecord record = new GenericData.Record(schema);
stringObjectMap.entrySet().forEach(entry->{record.put(entry.getKey(), entry.getValue());});
return record;
}
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
this.arrayOutputStream = new ByteArrayOutputStream();
this.encoder = EncoderFactory.get().binaryEncoder(arrayOutputStream, null);
this.datumWriter = new GenericDatumWriter<>(schema);
try {
this.schema = new Schema.Parser().parse(avroSchemaString);
} catch (SchemaParseException e) {
throw new IllegalArgumentException("Could not parse Avro schema string.", e);
}
}
});
final StreamingFileSink<GenericRecord> sink = StreamingFileSink
.forBulkFormat(new Path("D:\\test"), AvroWriters.forGenericRecord(mongoSchema))
.build();
records.addSink(sink);

how to use the TUMBLE(time_attr, interval) window function in Flink SQL

I want to use TUMBLE(time_attr, interval) window function in My Flink SQL, but I don't know how to set the 'time_atttr' based on my data.
below is one line of my kafka source, it's json format, the body field containes the user logs:
{
body: [
"user1,url1,2018-10-23 00:00:00;user2,url2,2018-10-23 00:01:00;user3,url3,2018-10-23 00:02:00"
]}
I user LATERAL TABLE and a User-Defined TableFunction flatmap the source to a new table log, and I want to group by the time and username, here is my code:
public class BodySplitFun extends TableFunction<Tuple3<String, String, Long>> {
private SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
public void eval(Row bodyRow) {
String body = bodyRow.getField(0).toString();
String[] lines = body.split(";");
for (String line : lines) {
String user = line.split(",")[0];
String url = line.split(",")[1];
String sTime = line.split(",")[2];
collect(new Tuple3<>(user, url, sdf.parse(sTime).getTime());
}
}
}
}
tblEnv.registerFunction("bodySplit", new BodySplitFun());
tblEnv.sqlUpdate(
"select
count(username)
from
(
SELECT
username,
url,
sTime
FROM
mySource LEFT JOIN LATERAL TABLE(bodySplit(body)) as T(username, url, sTime) ON TRUE
)
log
group by
TUMBLE(log.sTime, INTERVAL '1' MINUTE), log.username");
when I run my program, I got these error message:
Caused by: org.apache.calcite.sql.validate.SqlValidatorException: Cannot apply 'TUMBLE' to arguments of type 'TUMBLE(<BIGINT>, <INTERVAL DAY>)'. Supported form(s): 'TUMBLE(<DATETIME>, <DATETIME_INTERVAL>)'
'TUMBLE(<DATETIME>, <DATETIME_INTERVAL>, <TIME>)'
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at org.apache.calcite.runtime.Resources$ExInstWithCause.ex(Resources.java:463)
at org.apache.calcite.runtime.Resources$ExInst.ex(Resources.java:572)
... 49 more
How can I use the sTime field of table log for group by operation?

Apache Flink JDBC InputFormat throwing java.net.SocketException: Socket closed

I am querying oracle database using Flink DataSet API. For this I have customised Flink JDBCInputFormat to return java.sql.Resultset. As I need to perform further operation on resultset using Flink operators.
public static void main(String[] args) throws Exception {
ExecutionEnvironment environment = ExecutionEnvironment.getExecutionEnvironment();
environment.setParallelism(1);
#SuppressWarnings("unchecked")
DataSource<ResultSet> source
= environment.createInput(JDBCInputFormat.buildJDBCInputFormat()
.setUsername("username")
.setPassword("password")
.setDrivername("driver_name")
.setDBUrl("jdbcUrl")
.setQuery("query")
.finish(),
new GenericTypeInfo<ResultSet>(ResultSet.class)
);
source.print();
environment.execute();
}
Following is the customised JDBCInputFormat:
public class JDBCInputFormat extends RichInputFormat<ResultSet, InputSplit> implements ResultTypeQueryable {
#Override
public void open(InputSplit inputSplit) throws IOException {
Class.forName(drivername);
dbConn = DriverManager.getConnection(dbURL, username, password);
statement = dbConn.prepareStatement(queryTemplate, resultSetType, resultSetConcurrency);
resultSet = statement.executeQuery();
}
#Override
public void close() throws IOException {
if(statement != null) {
statement.close();
}
if(resultSet != null)
resultSet.close();
if(dbConn != null) {
dbConn.close();
}
}
#Override
public boolean reachedEnd() throws IOException {
isLastRecord = resultSet.isLast();
return isLastRecord;
}
#Override
public ResultSet nextRecord(ResultSet row) throws IOException{
if(!isLastRecord){
resultSet.next();
}
return resultSet;
}
}
This works with below query having limit in the row fetched:
SELECT a,b,c from xyz where rownum <= 10;
but when I try to fetch all the rows having approx 1 million of data, I am getting the below exception after fetching random number of rows:
java.sql.SQLRecoverableException: Io exception: Socket closed
at oracle.jdbc.driver.SQLStateMapping.newSQLException(SQLStateMapping.java:101)
at oracle.jdbc.driver.DatabaseError.newSQLException(DatabaseError.java:133)
at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:199)
at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:263)
at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:521)
at oracle.jdbc.driver.T4CPreparedStatement.fetch(T4CPreparedStatement.java:1024)
at oracle.jdbc.driver.OracleResultSetImpl.close_or_fetch_from_next(OracleResultSetImpl.java:314)
at oracle.jdbc.driver.OracleResultSetImpl.next(OracleResultSetImpl.java:228)
at oracle.jdbc.driver.ScrollableResultSet.cacheRowAt(ScrollableResultSet.java:1839)
at oracle.jdbc.driver.ScrollableResultSet.isValidRow(ScrollableResultSet.java:1823)
at oracle.jdbc.driver.ScrollableResultSet.isLast(ScrollableResultSet.java:349)
at JDBCInputFormat.reachedEnd(JDBCInputFormat.java:98)
at org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:173)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.SocketException: Socket closed
at java.net.SocketOutputStream.socketWrite0(Native Method)
So for my case, how i can solve this issue?
I don't think it is possible to ship a ResultSet like a regular record. This is a stateful object that internally maintains a connection to the database server. Using a ResultSet as a record that is transferred between Flink operators means that it can be serialized, shipped over the via the network to another machine, deserialized, and handed to a different thread in a different JVM process. That does not work.
Depending on the connection a ResultSet might as well stay on the same machine in the same thread, which might be the case that worked for you. If you want to query a database from within an operator, you could implement the function as a RichMapPartitionFunction. Otherwise, I'd read the ResultSet in the data source and forward the resulting rows.

Resources