FLINK- how to process logic on sql query result - apache-flink

My requirement is to process or build some logic around the result of sql query in flink. For simplicity lets say I have two sql query they are running on different window size and one event stream. My question is
a) how I will know for which query result is this
b) how I will know how many rows are the result of executed query? I need this info as I have to build a notification message with list of event those are part of the query result.
DataStream<Event> ds = ...
String query = "select id, key" +
" from eventTable GROUP BY TUMBLE(rowTime, INTERVAL '10' SECOND), id, key ";
String query1 = "select id, key" +
" from eventTable GROUP BY TUMBLE(rowTime, INTERVAL '1' DAY), id, key ";
List<String> list = new ArrayList<>();
list.add(query);
list.add(query1);
tabEnv.createTemporaryView("eventTable", ds, $("id"), $("timeLong"), $("key"),$("rowTime").rowtime());
for(int i =0; i< list.size(); i++ ){
Table result = tabEnv.sqlQuery(list.get(i));
DataStream<Tuple2<Boolean, Row>> dsRow = tabEnv.toRetractStream(result, Row.class);
dsRow.process(new ProcessFunction<Tuple2<Boolean, Row>, Object>() {
List<Row> listRow = new ArrayList<>();
#Override
public void processElement(Tuple2<Boolean, Row> booleanRowTuple2, Context context, Collector<Object> collector) throws Exception {
listRow.add(booleanRowTuple2.f1);
}
});
}
Appreciate your help. thanks Ashutosh

To sort out which results are from which query, you could include an identifier for each query in the queries themselves, e.g.,
SELECT '10sec', id, key FROM eventTable GROUP BY TUMBLE(rowTime, INTERVAL '10' SECOND), id, key
Determining the number of rows in the result table is trickier. One issue is that there is no final answer to the number of results from a streaming query. But where you are processing the results, it seems like you could count the number of rows.
Or, and I haven't tried this, but maybe you could use something like row_number() over(order by tumble_rowtime(rowTime, interval '10' second)) to annotate each row of the result with a counter.

Related

Apache Flink issue with JOINS on Kinesis Streams Rowtime attributes must not be in the input rows of a regular join

i am attempting a simple exercise
i have Two kinesis data stream
order-stream
shipment-stream
SQL 1 Orders
%flink.ssql
CREATE TABLE orders (
orderid VARCHAR(6),
orders VARCHAR,
ts TIMESTAMP(3),
WATERMARK FOR ts AS ts - INTERVAL '5' SECOND
)
WITH (
'connector' = 'kinesis',
'stream' = 'order-stream',
'aws.region' = 'us-east-1',
'scan.stream.initpos' = 'TRIM_HORIZON',
'format' = 'json',
'json.timestamp-format.standard' = 'ISO-8601'
);
SQL 2 shipment
CREATE TABLE shipment (
orderid VARCHAR(6),
shipments VARCHAR(6),
ts TIMESTAMP(3),
WATERMARK FOR ts AS ts - INTERVAL '5' SECOND
)
WITH (
'connector' = 'kinesis',
'stream' = 'shipment-stream',
'aws.region' = 'us-east-1',
'scan.stream.initpos' = 'TRIM_HORIZON',
'format' = 'json',
'json.timestamp-format.standard' = 'ISO-8601'
);
Generating Fake Data into Kinesis Via Python
try:
import datetime
import json
import random
import boto3
import os
import uuid
import time
from dotenv import load_dotenv
load_dotenv(".env")
except Exception as e:
pass
STREAM_NAME_Order = "order-stream"
STREAM_NAME_Shipments = "shipment-stream"
def send_data(kinesis_client):
order_items_number = random.randrange(1, 10000)
order_items = {
"orderid": order_items_number,
"orders": "1",
'ts': datetime.datetime.now().isoformat()
}
shipping_data = {
"orderid": order_items_number,
"shipments": random.randrange(1, 10000),
'ts': datetime.datetime.now().isoformat()
}
partition_key = uuid.uuid4().__str__()
res = kinesis_client.put_record(
StreamName=STREAM_NAME_Order,
Data=json.dumps(order_items),
PartitionKey=partition_key)
print(res)
time.sleep(2)
res = kinesis_client.put_record(
StreamName=STREAM_NAME_Shipments,
Data=json.dumps(shipping_data),
PartitionKey=partition_key)
print(res)
if __name__ == '__main__':
kinesis_client = boto3.client('kinesis',
aws_access_key_id=os.getenv("DEV_ACCESS_KEY"),
aws_secret_access_key=os.getenv("DEV_SECRET_KEY"),
region_name="us-east-1",
)
for i in range(1, 10):
send_data(kinesis_client)
%flink.ssql(type=update)
SELECT DISTINCT oo.orderid , TUMBLE_START(oo.ts, INTERVAL '10' MINUTE) as event_time
FROM orders as oo
GROUP BY orderid , TUMBLE(oo.ts, INTERVAL '10' MINUTE);
issue with Joining
%flink.ssql(type=update)
SELECT DISTINCT oo.orderid , TUMBLE_START(oo.ts, INTERVAL '10' MINUTE) as event_time , ss.shipments
FROM orders as oo
JOIN shipment AS ss ON oo.orderid = ss.orderid
GROUP BY oo.orderid , TUMBLE(oo.ts, INTERVAL '10' MINUTE) , ss.shipments
Error Messages
TableException: Rowtime attributes must not be in the input rows of a regular join. As a workaround you can cast the time attributes of input tables to TIMESTAMP before.
java.io.IOException: Fail to run stream sql job
at org.apache.zeppelin.flink.sql.AbstractStreamSqlJob.run(AbstractStreamSqlJob.java:172)
at org.apache.zeppelin.flink.sql.AbstractStreamSqlJob.run(AbstractStreamSqlJob.java:105)
at org.apache.zeppelin.flink.FlinkStreamSqlInterpreter.callInnerSelect(FlinkStreamSqlInterpreter.java:89)
at org.apache.zeppelin.flink.FlinkSqlInterrpeter.callSelect(FlinkSqlInterrpeter.java:503)
at org.apache.zeppelin.flink.FlinkSqlInterrpeter.callCommand(FlinkSqlInterrpeter.java:266)
at org.apache.zeppelin.flink.FlinkSqlInterrpeter.runSqlList(FlinkSqlInterrpeter.java:160)
at org.apache.zeppelin.flink.FlinkSqlInterrpeter.internalInterpret(FlinkSqlInterrpeter.java:112)
at org.apache.zeppelin.interpreter.AbstractInterpreter.interpret(AbstractInterpreter.java:47)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:110)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:852)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:744)
at org.apache.zeppelin.scheduler.Job.run(Job.java:172)
at org.apache.zeppelin.scheduler.AbstractScheduler.runJob(AbstractScheduler.java:132)
at org.apache.zeppelin.scheduler.ParallelScheduler.lambda$runJobInScheduler$0(ParallelScheduler.java:46)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.flink.table.api.TableException: Cannot generate a valid execution plan for the given query:
Also tried
%flink.ssql(type=update)
SELECT DISTINCT oo.orderid ,
TUMBLE_START( oo.ts, INTERVAL '1' MINUTE),
ss.shipments
FROM orders as oo
JOIN shipment AS ss ON oo.orderid = ss.orderid
GROUP BY oo.orderid ,
TUMBLE(CAST(oo.ts AS TIME) ,INTERVAL '1' MINUTE) ,
ss.shipments
Error Message :
SQL validation failed. From line 2, column 17 to line 2, column 57: Call to auxiliary group function 'TUMBLE_START' must have matching call to group function '$TUMBLE' in GROUP BY clause
i am not sure what exactly needs to be done here any help would be great. looking fwed to hear back from expert
TableException: Rowtime attributes must not be in the input rows of a regular join.
As a workaround you can cast the time attributes of input tables to TIMESTAMP before.
It is not possible for a regular join to have time attributes in its results, because the time attribute could not be well-defined. This is because the rows in a dynamic table must be at least roughly ordered by the time attribute, and there's no way to guarantee this for the result of a regular join (as opposed to an interval join, temporal join, or lookup join).
In versions of Flink before 1.14, the implementation dealt with this by not allowing regular joins to have time attributes in the input tables. While this avoided the problem, it was overly restrictive.
In your case I suggest you rewrite the join as an interval join, so that the output of the join will have a time attribute, making it possible to apply windowing to it.
In the second query, I'm not 100% sure what the problem is, but I suspect the problem is that in one case you are using oo.ts, vs CAST(oo.ts AS TIME) in the other. I think they need to be the same. I don't think Flink's SQL planner is smart enough to figure out what's going on here.

Flink - SQL Tumble End on event time not returning any result

I have a Flink job that consumes from a kafka topic and tries to create windows based on few columns like eventId and eventName. Kafka topic has eventTimestamp as the timestamp field with timestamp populated in millis
DataStreamSource kafkaStream = env.fromSource(
kafkaSource, //kafkaSource is the KafkaSource builder
WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(20)), "KafkaSource");
// Doing some transformations to map to POJO class.
Table kafkaTable = tableEnv.fromDataStream(
kafkaSource,
Schema.newBuilder()
.columnByExpression("proc_time", "PROCTIME()")
// eventTimestamp is in millis
.columnByExpression("event_time", "TO_TIMESTAMP_LTZ(eventTimestamp, 3)")
.watermark("event_time", "event_time - INTERVAL '20' SECOND")
.build();
The Tumble_End window query returns rows when proc_time is used, but doesn't return anything when I use event_time.
SELECT TUMBLE_END(event_time, INTERVAL '1' MINUTE), COUNT(DISTINCT eventId)
FROM kafkaTable GROUP BY TUMBLE(event_time, INTERVAL '1' MINUTE)"
-- This query gives some results
SELECT TUMBLE_END(proc_time, INTERVAL '1' MINUTE), COUNT(DISTINCT eventId)
FROM kafkaTable GROUP BY TUMBLE(proc_time, INTERVAL '1' MINUTE)"
I tried to set env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime); but this is deprecated as I'm using 1.14.4 stable version.
I tried adding custom WatermarkStrategy as well, but nothing worked. I'm not able to identify this behaviour. Can someone help on this?
David - Here is the code I'm using.
main() {
val env = StreamExecutionEnvironment.getExecutionEnvironment()
val tableEnv = StreamTableEnvironment.create(env)
val kafkaSource = KafkaSource.builder<String>()
.setBootstrapServers("localhost:9092")
.setTopics("an-topic")
.setGroupId("testGroup")
.setStartingOffsets(OffsetsInitializer.earliest())
.setValueOnlyDeserializer(SimpleStringSchema())
.build()
val kafkaStream = env.fromSource(kafkaSource,
WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(20)), "KafkaSource")
val kafkaRowMapper = kafkaStream.map(RowMapper())
val finalTable = tableEnv.fromDataStream(kafkaRowMapper,
Schema.newBuilder()
.columnByExpression("proc_time", "PROCTIME()")
.columnByExpression("event_time", "TO_TIMESTAMP_LTZ(f2, 3)")
.watermark("event_time", "event_time - INTERVAL '20' SECOND")
.build()
).renameColumns(
`$`("f0").`as`("eventId"),
`$`("f1").`as`("eventName"),
`$`("f3").`as`("eventValue")
)
tableEnv.createTemporaryView("finalTable", finalTable)
val sqlQuery = "SELECT eventId, eventName, TUMBLE_END(event_time, INTERVAL '1' MINUTE) AS event_time_new, " +
"LAST_VALUE(eventValue) AS eventValue FROM finalTable " +
"GROUP BY eventId, eventName, TUMBLE(event_time, INTERVAL '1' MINUTE)"
val resultTable = tableEnv.sqlQuery(sqlQuery)
tableEnv.toDataStream(resultTable).print()
env.execute("TestJob")
}
class RowMapper: MapFunction<String, Tuple4<String, String, Long, Float>> {
override fun map(value: String): Tuple4<String, String, Long, Float> {
val lineArray = value.split(",")
return Tuple4 (lineArray[0], lineArray[1], lineArray[2].toLong(), lineArray[3].toFloat())
}
}
Kafka topic has values like this
event1,Util1,1647614467000,0.12
event1,Util1,1647614527000,0.26
event1,Util1,1647614587000,0.71
event2,Util2,1647614647000,0.08
event2,Util2,1647614707000,0.32
event2,Util2,1647614767000,0.23
event2,Util2,1647614827000,0.85
event1,Util1,1647614887000,0.08
event1,Util1,1647614947000,0.32
I have added the below line after creating table environment and then I'm able to create windows using event_time
tableEnv.config.configuration.setString("table.exec.source.idle-timeout", "5000 ms")

Multiple AggregateResult Querys

Hi guys,
I'm currently trying to join two objects in a same query or result.
My question is if it's possible to show or debug the sum of FIELD A FROM LEAD + sum of FIELD B FROM two different Objects.
Here's an example I'm working on:
Btw I really appreciate your time and comments, and if i'm making a mistake pls let me know, thank you.
public static void example() {
String sQueryOne;
String sQueryTwo;
AggregateResult[] objOne;
AggregateResult[] objTwo;
//I tried to save the following querys into a sObject List
List<SObject> bothObjects = new List<SObject>();
sQueryOne = 'Select Count(Id) records, Sum(FieldA) fieldNA From Lead';
objOne = Database.query(sQueryOne);
sQueryTwo = 'Select Count(Id) records, Sum(FieldA) fieldNB From Opportunity';
objTwo = Database.query(sQueryTwo);
bothObjects.addAll(objOne);
bothObjects.addAll(objTwo);
for(sObject totalRec : bothObjects) {
//There's a Wrapper(className) I created which contains some variables(totalSum)
className finalRes = new className();
finalRes.totalSum = (Integer.valueOf(fieldNA)) + (Integer.valueOf(fieldNB));
System.debug('The sum is: '+finalRes.totalSum);
For example if I call a System debug with the previous variable finalRes.totalSum it's just showing the first value(fieldNA) duplicated.
The following debug shows the current values of the sObject List which I want to sum for example FIELD0 = from leads, FIELD0 = from Opportunities.
}
}
You access the columns in AggregateResult by calling get('columnAlias'). If you didn't specify an alias they'll be autonumbered by SF as expr0, expr1... When in doubt you can always go System.debug(results);
Some more info: https://developer.salesforce.com/docs/atlas.en-us.apexcode.meta/apexcode/langCon_apex_SOQL_agg_fns.htm
This might give you some ideas:
List<AggregateResult> results = new List<AggregateResult>{
[SELECT COUNT(Id) records, SUM(NumberOfEmployees) fieldA, SUM(AnnualRevenue) fieldB FROM Account],
[SELECT COUNT(Id) records, SUM(Amount) fieldA, SUM(TotalOpportunityQuantity) fieldB FROM Opportunity],
[SELECT COUNT(Id) records, SUM(NumberOfEmployees) fieldA, SUM(AnnualRevenue) fieldB FROM Lead]
/* hey, not my fault these are only 2 standard numeric fields on Lead.
It doesn't matter that they're identical to Account fields, what matters is what their SUM(...) aliases are
*/
};
List<Decimal> totals = new List<Decimal>{0,0,0};
for(AggregateResult ar : results){
totals[0] += (Decimal) ar.get('records');
totals[1] += (Decimal) ar.get('fieldA');
totals[2] += (Decimal) ar.get('fieldB');
}
System.debug(totals); // (636, 8875206.0, 9819762558.0) in my dev org
(I'm not saying it's perfect, your wrapper class sounds like better idea or maybe even Map<String, Decimal>. Depends what are you going to do with the results)

Get random result from JPQL query over large table

I'm currently using JPQL queries to retrieve information from a database. The purpose of the project is testing a sample environments through randomness with different elements so I need the queries to retrieve random single results all through the project.
I am facing that JPQL does not implement a proper function for random retrieval and postcalculation of random takes too long (14 seconds for the attached function to return a random result)
public Player getRandomActivePlayerWithTransactions(){
List<Player> randomPlayers = entityManager.createQuery("SELECT pw.playerId FROM PlayerWallet pw JOIN pw.playerId p"
+ " JOIN p.gameAccountCollection ga JOIN ga.iDAccountStatus acs"
+ " WHERE (SELECT count(col.playerWalletTransactionId) FROM pw.playerWalletTransactionCollection col) > 0 AND acs.code = :status")
.setParameter("status", "ACTIVATED")
.getResultList();
return randomPlayers.get(random.nextInt(randomPlayers.size()));
}
As ORDER BY NEWID() is not allowed because of JPQL restrictions I have tested the following inline conditions, all of them returned with syntax error on compilation.
WHERE (ABS(CAST((BINARY_CHECKSUM(*) * RAND()) as int)) % 100) < 10
WHERE Rnd % 100 < 10
FROM TABLESAMPLE(10 PERCENT)
Have you consider to generate a random number and skip to that result?
I mean something like this:
String q = "SELECT COUNT(*) FROM Player p";
Query query=entityManager.createQuery(q);
Number countResult=(Number) query.getSingleResult();
int random = Math.random()*countResult.intValue();
List<Player> randomPlayers = entityManager.createQuery("SELECT pw.playerId FROM PlayerWallet pw JOIN pw.playerId p"
+ " JOIN p.gameAccountCollection ga JOIN ga.iDAccountStatus acs"
+ " WHERE (SELECT count(col.playerWalletTransactionId) FROM pw.playerWalletTransactionCollection col) > 0 AND acs.code = :status")
.setParameter("status", "ACTIVATED")
.setFirstResult(random)
.setMaxResults(1)
.getSingleResult();
I have figured it out. When retrieving the player I was also retrieving other unused related entity and all the entities related with that one and so one.
After adding fetch=FetchType.LAZY (don't fetch entity until required) to the problematic relation the performance of the query has increased dramatically.

Hibernate createSQLQuery give me I/O Error: Sofware caused connection abort on SqlServer 2008 RS1

Im using SqlServer FREETEXTTABLE to search some products on a view, passing 3 parameters, the Strings to search, the first row to return and the last row to return (the usual pagination).
The code goes like this:
public List searchProducto(final String claves,final Integer pagina, final Integer nroFilas) {
return getHibernateTemplate().executeFind(new HibernateCallback() {
public Object doInHibernate(Session session) throws HibernateException, SQLException {
String sql = "SELECT * FROM (SELECT ROW_NUMBER() OVER(ORDER BY RANK DESC) AS RowNr, * FROM vista_web_productos AS vwp INNER JOIN FREETEXTTABLE( producto,clave,:name) as ftt ON ftt.[KEY]=vwp.id_web) t WHERE RowNr BETWEEN :inicio AND :fin";
SQLQuery query = (SQLQuery) session.createSQLQuery(sql);
query.addEntity(VistaWebProductoVO.class).setString("name",claves).setInteger("inicio", ((pagina - 1) * nroFilas)+1 ).setInteger("fin", ((pagina) * nroFilas) );
List list = query.list();
return list;
}
});
}
When I test the code , sometimes it works (usually the first time works ok and them) and sometimes it doesnt. Specialy if I debug my test and i dont press the continue button (f5) immediatly it just doesnt work. Any ideas with the cause of the problem? Im using also Spring 3.

Resources