Event processing by using Flink SQL API

Event processing by using Flink SQL API - apache-flink

My Use case-
Collect events for a particular duration and then group them based on the key
Objective
After processing, user can save data of particular duration based on the key
How i am planning to do
1)Receive events from Kafka
2)Create data stream of events
3)associate a table with it and collect data for a particular duration by running a SQL query
4)associate a new table with step-2 output and group collected data according to the key
5)save the data in DB
Solution i tried-
I am able to-
1)receive events from Kafka,
2)setup a data stream(lets say sensorDataStream)-
DataStream<SensorEvent> sensorDataStream
= source.flatMap(new FlatMapFunction<String, SensorEvent>() {
#Override
public void flatMap(String catalog, Collector<SensorEvent> out) {
// create SensorEvent(id, sensor notification value, notification time) creation
});
3)associate a table(lets say table1) with data stream and after running SQL query like-
SELECT id, sensorNotif, notifTime FROM SENSORTABLE WHERE notifTime > t1_Timestamp AND notifTime < t2_Timestamp.
Here t1_Timestamp and t2_Timestamp is predefined epoch time and will change based on some predefined conditions
4)I am able to print this sql query result by using following query on the console-
tableEnv.toAppendStream(table1, Row.class).print();
5)Created a new table(lets say table2) by using table1 and following type of sql query-
Table table2 = tableEnv.sqlQuery("SELECT id AS SensorID, COUNT(sensorNotif) AS SensorNotificationCount FROM table1 GROUP BY id);
6)Collecting and print data by using -
tableEnv.toRetractStream(table2 , Row.class).print();
Problem
1)I am not able to see output of step 6 on the console.
I did some experiment and found that If i skip table1 setup step(that means no sensor data clubbing for a duration) and directly associate my senserDataStream with table2 then i can see the output of step-6 but as this is RetractStream so i can see data in the form of and if new event is coming then this retract stream will invalidate data and print newly calculated data.
Suggestion i would like to have
1)How can i merge step 5 and step 6(means table1 and table2). I already merged these tables but as data is not visible on console so i have doubt? Am i doing something wrong? Or data is merged but not visible?
2)My plan is to --
2.a)filter data in 2 pass, in first pass filter data for a particular interval and in second pass group this data
2.b)Save 2.a output in DB
Will this approach work(i have doubt because i am using data stream and table1 out put is append stream but table2 output is retract stream)?

Related

Finding missing records from 2 data sources with Flink

I have two data sources - an S3 bucket and a postgres database table. Both sources have records in the same format with a unique identifier of type uuid. Some of the records present in the S3 bucket are not part of the postgres table and the intent is to find those missing records. The data is bounded as it is partitioned by every day in the s3 bucket.
Reading the s3-source (I believe this operation is reading the data in batch mode since I am not providing the monitorContinuously() argument) -
final FileSource<GenericRecord> source = FileSource.forRecordStreamFormat(
AvroParquetReaders.forGenericRecord(schema), path).build();
final DataStream<GenericRecord> avroStream = env.fromSource(
source, WatermarkStrategy.noWatermarks(), "s3-source");
DataStream<Row> s3Stream = avroStream.map(x -> Row.of(x.get("uuid").toString()))
.returns(Types.ROW_NAMED(new String[] {"uuid"}, Types.STRING));
Table s3table = tableEnv.fromDataStream(s3Stream);
tableEnv.createTemporaryView("s3table", s3table);
For reading from Postgres, I created a postgres catalog -
PostgresCatalog postgresCatalog = (PostgresCatalog) JdbcCatalogUtils.createCatalog(
catalogName,
defaultDatabase,
username,
pwd,
baseUrl);
tableEnv.registerCatalog(postgresCatalog.getName(), postgresCatalog);
tableEnv.useCatalog(postgresCatalog.getName());
Table dbtable = tableEnv.sqlQuery("select cast(uuid as varchar) from `localschema.table`");
tableEnv.createTemporaryView("dbtable", dbtable);
My intention was to simply perform left join and find the missing records from the dbtable. Something like this -
Table resultTable = tableEnv.sqlQuery("SELECT * FROM s3table LEFT JOIN dbtable ON s3table.uuid = dbtable.uuid where dbtable.uuid is null");
DataStream<Row> resultStream = tableEnv.toDataStream(resultTable);
resultStream.print();
However, it seems that the UUID column type is not supported just yet because I get the following exception.
Caused by: java.lang.UnsupportedOperationException: Doesn't support Postgres type 'uuid' yet
at org.apache.flink.connector.jdbc.dialect.psql.PostgresTypeMapper.mapping(PostgresTypeMapper.java:171)
As an alternative, I tried to read the database table as follows -
TypeInformation<?>[] fieldTypes = new TypeInformation<?>[] {
BasicTypeInfo.of(String.class)
};
RowTypeInfo rowTypeInfo = new RowTypeInfo(fieldTypes);
JdbcInputFormat jdbcInputFormat = JdbcInputFormat.buildJdbcInputFormat()
.setDrivername("org.postgresql.Driver")
.setDBUrl("jdbc:postgresql://127.0.0.1:5432/localdatabase")
.setQuery("select cast(uuid as varchar) from localschema.table")
.setUsername("postgres")
.setPassword("postgres")
.setRowTypeInfo(rowTypeInfo)
.finish();
DataStream<Row> dbStream = env.createInput(jdbcInputFormat);
Table dbtable = tableEnv.fromDataStream(dbStream).as("uuid");
tableEnv.createTemporaryView("dbtable", dbtable);
Only this time, I get the following exception on performing the left join (as above) -
Exception in thread "main" org.apache.flink.table.api.TableException: Table sink '*anonymous_datastream_sink$3*' doesn't support consuming update and delete changes which is produced by node Join(joinType=[LeftOuterJoin]
It works if I tweak the resultStream to publish the changeLogStream -
Table resultTable = tableEnv.sqlQuery("SELECT * FROM s3table LEFT JOIN dbtable ON s3table.sync_id = dbtable.sync_id where dbtable.sync_id is null");
DataStream<Row> resultStream = tableEnv.toChangelogStream(resultTable);
resultStream.print();
Sample O/P
+I[9cc38226-bcce-47ce-befc-3576195a0933, null]
+I[a24bf933-1bb7-425f-b1a7-588fb175fa11, null]
+I[da6f57c8-3ad1-4df5-9636-c6b36df2695f, null]
+I[2f3845c1-6444-44b6-b1e8-c694eee63403, null]
-D[9cc38226-bcce-47ce-befc-3576195a0933, null]
-D[a24bf933-1bb7-425f-b1a7-588fb175fa11, null]
However, I do not want the sink to have Inserts and Deletes as separate. I want just the final list of missing uuids. I guess it happens because my Postgres Source created with DataStream<Row> dbStream = env.createInput(jdbcInputFormat); is a streaming source. If I try to execute the whole application in BATCH mode, I get the following exception -
org.apache.flink.table.api.ValidationException: Querying an unbounded table '*anonymous_datastream_source$2*' in batch mode is not allowed. The table source is unbounded.
Is it possible to have a bounded JDBC source? If not, how can I achieve this using the streaming API. (using Flink version - 1.15.2)
I believe this kind of case would be a common usecase that can be implemented with Flink but clearly I'm missing something. Any leads would be appreciated.

For now common approach would be to sink the resultStream to a table. So you can schedule a job which truncates the table and then executes the Apache Flink job. And then read the results from this table.
I also noticed Apache Flink Table Store 0.3.0 is just released. And they have materialized views on the roadmap for 0.4.0. This might be a solution too. Very exciting imho.

Joining continuous queries in Flink SQL

I'm trying to join two continuous queries, but keep running into the following error:
Rowtime attributes must not be in the input rows of a regular join. As a workaround you can cast the time attributes of input tables to TIMESTAMP before.\nPlease check the documentation for the set of currently supported SQL features.
Here's the table definition:
CREATE TABLE `Combined` (
`machineID` STRING,
`cycleID` BIGINT,
`start` TIMESTAMP(3),
`end` TIMESTAMP(3),
WATERMARK FOR `end` AS `end` - INTERVAL '5' SECOND,
`sensor1` FLOAT,
`sensor2` FLOAT
)
and the insert query
INSERT INTO `Combined`
SELECT
a.`MachineID`,
a.`cycleID`,
MAX(a.`start`) `start`,
MAX(a.`end`) `end`,
MAX(a.`sensor1`) `sensor1`,
MAX(m.`sensor2`) `sensor2`
FROM `Aggregated` a, `MachineStatus` m
WHERE
a.`MachineID` = m.`MachineID` AND
a.`cycleID` = m.`cycleID` AND
a.`start` = m.`timestamp`
GROUP BY a.`MachineID`, a.`cycleID`, SESSION(a.`start`, INTERVAL '1' SECOND)
In the source tables Aggregated and MachineStatus, the start and timestamp columns are time attributes with a watermark.
I've tried casting the input rows of the join to timestamps, but that didn't fix the issue and would mean that I cannot use SESSION, which is supposed to ensure that only one data point gets recorded per cycle.
Any help is greatly appreciated!

I investigated this a little further and noticed that the GROUP BY statement doesn't make sense in that context.
Furthermore, the SESSION can be replaced by a time window, which is the more idiomatic approach.
INSERT INTO `Combined`
SELECT
a.`MachineID`,
a.`cycleID`,
a.`start`,
a.`end`,
a.`sensor1`,
m.`sensor2`
FROM `Aggregated` a, `MachineStatus` m
WHERE
a.`MachineID` = m.`MachineID` AND
a.`cycleID` = m.`cycleID` AND
m.`timestamp` BETWEEN a.`start` AND a.`start` + INTERVAL '0' SECOND
To understand the different ways to join dynamic tables, I found the Ververica SQL training extremely helpful.

How to automatically copy fields from one Sobject to another

Im trying to create automation to copy Zipcode__c text field in connection sObject to Zip_code__c text field on Prem sObject. I can't use formula references since i need to be able to search copied field. One connection can have many Prems.
trigger updatePremFromConnection on Prem__c (before insert,after insert, after update,before update) {
List<Connection__c> connection = new List<Connection__c>();
for (Prem__c p: [SELECT Connection_id__c,id, Name
FROM Prem__c
WHERE Connection_id__c
NOT IN (SELECT id FROM Connection__c)
AND id IN : Trigger.new ]){
connection.add(new Connection__c(
ZipCode__c = p.Zip_Code__c));
}
if (connection.size() > 0) {
insert connection;
}
}
i need ZIp code field on the prem__c to be auto updated when i edit connection.

There are several issues with this code.
Trigger Object
Your trigger is on the wrong object and is doing exactly the opposite of your stated intent.
i need ZIp code field on the prem__c to be auto updated when i edit connection.
Your trigger on the Prem__c object attempts to copy data to the Connection__c object, while your objective is to copy from Prem__c to Connection__c. You'll definitely need an after update trigger on Connection__c and a before insert trigger on Prem__c; however, if the relationship between the two objects is Lookup or a Master-Detail relationship configured to be reparentable, you'll also need an update trigger on the child object Prem__c to handle situations where the child record is reparented, by updating from the new parent Connection.
Logic
This logic:
for (Prem__c p: [SELECT Connection_id__c,id, Name
FROM Prem__c
WHERE Connection_id__c
NOT IN (SELECT id FROM Connection__c)
AND id IN : Trigger.new ]){
connection.add(new Connection__c(
ZipCode__c = p.Zip_Code__c));
}
really doesn't make sense. It only finds Prem__c records in the trigger set that don't have an associated Connection, makes a new Connection, and then doesn't establish a relationship between the two records. The way that it does this is needlessly inefficient; that NOT IN subquery doesn't need to be there because it can simply by Connection_Id__c = null.
Instead, you probably want your Connection__c trigger to have a query like this:
SELECT ZipCode__c, (SELECT Zip_Code__c FROM Prems__r)
FROM Connection__c
WHERE Id IN :Trigger.new
Then, you can iterate over those Connection__c records with an inner for loop over their associated Prem__c records. Note that above you'll need to use the actual relationship name where I have Prems__r. The logic would look something like this:
for (Connection__c conn : queriedConnections) {
for (Prem__c prem : conn.Prems__r) {
if (prem.Zip_Code__c != conn.ZipCode__c) {
prem.Zip_Code__c = conn.ZipCode__c
premsToUpdate.add(prem);
}
}
}
update premsToUpdate;
Before running the query, you should also gather a Set<Id> of only those records for which the ZipCode__c field has actually changed, i.e., where thisConn.ZipCode__c != Trigger.oldMap.get(thisConn.Id).ZipCode__c. You'd use that Set<Id> in place of Trigger.new in your query, so that you only obtain those records with relevant changes.

Unsupported query for Continuous Query Notification

I have a View Object A that i am using as an Accessor in a List of value in view object VO. A has a view criteria too to filter search. Auto Refresh is set to True.
Now when ever i open the List of Values on VO and search the List of value using i am getting this error
Unsupported query for Continuous Query Notification
Here is my query for A ( which is (again) being used as List of values accessor
select e.employee_id,
e.first_name emp_name ,
d.department_name,
j.job_title,
g.grade_name
from
hr_employees e,
hr_departments d,
hr_jobs j,
prl_grade_header g
where e.department_id = d.department_id
and e.job_id = j.job_id(+)
and e.grade_id = g.grade_header_id(+)
My query does not have anything that can cause this error as per oracle documentation.
What is going on wrong, how can i get rid of this error.

As per Oracle one of the objects do not reference to a tables name directly.
Maybe you are using synonyms instead of tables in your query.
And There is a restriction at the database level, that prevents using synonyms for Object Change Notification (OCN) registration.

Correct method of deleting over 2100 rows (by ID) with Dapper

I am trying to use Dapper support my data access for my server app.
My server app has another application that drops records into my database at a rate of 400 per minute.
My app pulls them out in batches, processes them, and then deletes them from the database.
Since data continues to flow into the database while I am processing, I don't have a good way to say delete from myTable where allProcessed = true.
However, I do know the PK value of the rows to delete. So I want to do a delete from myTable where Id in #listToDelete
Problem is that if my server goes down for even 6 mintues, then I have over 2100 rows to delete.
Since Dapper takes my #listToDelete and turns each one into a parameter, my call to delete fails. (Causing my data purging to get even further behind.)
What is the best way to deal with this in Dapper?
NOTES:
I have looked at Tabled Valued Parameters but from what I can see, they are not very performant. This piece of my architecture is the bottle neck of my system and I need to be very very fast.

One option is to create a temp table on the server and then use the bulk load facility to upload all the IDs into that table at once. Then use a join, EXISTS or IN clause to delete only the records that you uploaded into your temp table.
Bulk loads are a well-optimized path in SQL Server and it should be very fast.
For example:
Execute the statement CREATE TABLE #RowsToDelete(ID INT PRIMARY KEY)
Use a bulk load to insert keys into #RowsToDelete
Execute DELETE FROM myTable where Id IN (SELECT ID FROM #RowsToDelete)
Execute DROP TABLE #RowsToDelte (the table will also be automatically dropped if you close the session)
(Assuming Dapper) code example:
conn.Open();
var columnName = "ID";
conn.Execute(string.Format("CREATE TABLE #{0}s({0} INT PRIMARY KEY)", columnName));
using (var bulkCopy = new SqlBulkCopy(conn))
{
bulkCopy.BatchSize = ids.Count;
bulkCopy.DestinationTableName = string.Format("#{0}s", columnName);
var table = new DataTable();
table.Columns.Add(columnName, typeof (int));
bulkCopy.ColumnMappings.Add(columnName, columnName);
foreach (var id in ids)
{
table.Rows.Add(id);
}
bulkCopy.WriteToServer(table);
}
//or do other things with your table instead of deleting here
conn.Execute(string.Format(#"DELETE FROM myTable where Id IN
(SELECT {0} FROM #{0}s", columnName));
conn.Execute(string.Format("DROP TABLE #{0}s", columnName));

To get this code working, I went dark side.
Since Dapper makes my list into parameters. And SQL Server can't handle a lot of parameters. (I have never needed even double digit parameters before). I had to go with Dynamic SQL.
So here was my solution:
string listOfIdsJoined = "("+String.Join(",", listOfIds.ToArray())+")";
connection.Execute("delete from myTable where Id in " + listOfIdsJoined);
Before everyone grabs the their torches and pitchforks, let me explain.
This code runs on a server whose only input is a data feed from a Mainframe system.
The list I am dynamically creating is a list of longs/bigints.
The longs/bigints are from an Identity column.
I know constructing dynamic SQL is bad juju, but in this case, I just can't see how it leads to a security risk.

Dapper request the List of object having parameter as a property so in above case a list of object having Id as property will work.
connection.Execute("delete from myTable where Id in (#Id)", listOfIds.AsEnumerable().Select(i=> new { Id = i }).ToList());
This will work.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Event processing by using Flink SQL API - apache-flink

Related

Finding missing records from 2 data sources with Flink

Joining continuous queries in Flink SQL

How to automatically copy fields from one Sobject to another

Unsupported query for Continuous Query Notification

Correct method of deleting over 2100 rows (by ID) with Dapper

Categories

Resources