Finding missing records from 2 data sources with Flink - apache-flink

I have two data sources - an S3 bucket and a postgres database table. Both sources have records in the same format with a unique identifier of type uuid. Some of the records present in the S3 bucket are not part of the postgres table and the intent is to find those missing records. The data is bounded as it is partitioned by every day in the s3 bucket.
Reading the s3-source (I believe this operation is reading the data in batch mode since I am not providing the monitorContinuously() argument) -
final FileSource<GenericRecord> source = FileSource.forRecordStreamFormat(
AvroParquetReaders.forGenericRecord(schema), path).build();
final DataStream<GenericRecord> avroStream = env.fromSource(
source, WatermarkStrategy.noWatermarks(), "s3-source");
DataStream<Row> s3Stream = avroStream.map(x -> Row.of(x.get("uuid").toString()))
.returns(Types.ROW_NAMED(new String[] {"uuid"}, Types.STRING));
Table s3table = tableEnv.fromDataStream(s3Stream);
tableEnv.createTemporaryView("s3table", s3table);
For reading from Postgres, I created a postgres catalog -
PostgresCatalog postgresCatalog = (PostgresCatalog) JdbcCatalogUtils.createCatalog(
catalogName,
defaultDatabase,
username,
pwd,
baseUrl);
tableEnv.registerCatalog(postgresCatalog.getName(), postgresCatalog);
tableEnv.useCatalog(postgresCatalog.getName());
Table dbtable = tableEnv.sqlQuery("select cast(uuid as varchar) from `localschema.table`");
tableEnv.createTemporaryView("dbtable", dbtable);
My intention was to simply perform left join and find the missing records from the dbtable. Something like this -
Table resultTable = tableEnv.sqlQuery("SELECT * FROM s3table LEFT JOIN dbtable ON s3table.uuid = dbtable.uuid where dbtable.uuid is null");
DataStream<Row> resultStream = tableEnv.toDataStream(resultTable);
resultStream.print();
However, it seems that the UUID column type is not supported just yet because I get the following exception.
Caused by: java.lang.UnsupportedOperationException: Doesn't support Postgres type 'uuid' yet
at org.apache.flink.connector.jdbc.dialect.psql.PostgresTypeMapper.mapping(PostgresTypeMapper.java:171)
As an alternative, I tried to read the database table as follows -
TypeInformation<?>[] fieldTypes = new TypeInformation<?>[] {
BasicTypeInfo.of(String.class)
};
RowTypeInfo rowTypeInfo = new RowTypeInfo(fieldTypes);
JdbcInputFormat jdbcInputFormat = JdbcInputFormat.buildJdbcInputFormat()
.setDrivername("org.postgresql.Driver")
.setDBUrl("jdbc:postgresql://127.0.0.1:5432/localdatabase")
.setQuery("select cast(uuid as varchar) from localschema.table")
.setUsername("postgres")
.setPassword("postgres")
.setRowTypeInfo(rowTypeInfo)
.finish();
DataStream<Row> dbStream = env.createInput(jdbcInputFormat);
Table dbtable = tableEnv.fromDataStream(dbStream).as("uuid");
tableEnv.createTemporaryView("dbtable", dbtable);
Only this time, I get the following exception on performing the left join (as above) -
Exception in thread "main" org.apache.flink.table.api.TableException: Table sink '*anonymous_datastream_sink$3*' doesn't support consuming update and delete changes which is produced by node Join(joinType=[LeftOuterJoin]
It works if I tweak the resultStream to publish the changeLogStream -
Table resultTable = tableEnv.sqlQuery("SELECT * FROM s3table LEFT JOIN dbtable ON s3table.sync_id = dbtable.sync_id where dbtable.sync_id is null");
DataStream<Row> resultStream = tableEnv.toChangelogStream(resultTable);
resultStream.print();
Sample O/P
+I[9cc38226-bcce-47ce-befc-3576195a0933, null]
+I[a24bf933-1bb7-425f-b1a7-588fb175fa11, null]
+I[da6f57c8-3ad1-4df5-9636-c6b36df2695f, null]
+I[2f3845c1-6444-44b6-b1e8-c694eee63403, null]
-D[9cc38226-bcce-47ce-befc-3576195a0933, null]
-D[a24bf933-1bb7-425f-b1a7-588fb175fa11, null]
However, I do not want the sink to have Inserts and Deletes as separate. I want just the final list of missing uuids. I guess it happens because my Postgres Source created with DataStream<Row> dbStream = env.createInput(jdbcInputFormat); is a streaming source. If I try to execute the whole application in BATCH mode, I get the following exception -
org.apache.flink.table.api.ValidationException: Querying an unbounded table '*anonymous_datastream_source$2*' in batch mode is not allowed. The table source is unbounded.
Is it possible to have a bounded JDBC source? If not, how can I achieve this using the streaming API. (using Flink version - 1.15.2)
I believe this kind of case would be a common usecase that can be implemented with Flink but clearly I'm missing something. Any leads would be appreciated.

For now common approach would be to sink the resultStream to a table. So you can schedule a job which truncates the table and then executes the Apache Flink job. And then read the results from this table.
I also noticed Apache Flink Table Store 0.3.0 is just released. And they have materialized views on the roadmap for 0.4.0. This might be a solution too. Very exciting imho.

Related

Flink SQL running out of memory doing Select - Insert from RDS to Mysql

In my pipeline I am using pyflink to load & transform data from an RDS and sink to a MYSQL. Using FLINK CDC I am able to get the data I want from the RDS and with JDBC library sink to MYSQL. My aim is to read 1 table and create 10 others using a sample of the code below, in 1 job (basically breaking a huge table in smaller tables). The problem I am facing is despite using RocksDB as state backend and options in flink cdc such as scan.incremental.snapshot.chunk.size and scan.snapshot.fetch.size and debezium.min.row. count.to.stream.result the usage memory keeps growing causing a Taskmanager with 2GB memory to fail. My intuition here is that a simple select- insert query loads all table in memory no matter what!If so, can I somehow avoid that? The table size is around 500k rows.
env = StreamExecutionEnvironment.get_execution_environment()
t_env = StreamTableEnvironment.create(env)
stmt_set = t_env.create_statement_set()
create_kafka_source= (
"""
CREATE TABLE somethin(
bla INT,
bla1 DOUBLE,
bla2 TIMESTAMP(3),
PRIMARY KEY(bla2) NOT ENFORCED
) WITH (
'connector'='mysql-cdc',
'server-id'='1000',
'debezium.snapshot.mode' = 'when_needed',
'debezium.poll.interval.ms'='5000',
'hostname'= 'som2',
'port' ='som2',
'database-name'='som3',
'username'='som4',
'password'='somepass',
'table-name' = 'atable'
)
"""
)
create_kafka_dest = (
"""CREATE TABLE IF NOT EXISTS atable(
time1 TIMESTAMP(3),
blah2 DOUBLE,
PRIMARY KEY(time_stamp) NOT ENFORCED
) WITH ( 'connector'= 'jdbc',
'url' = 'jdbc:mysql://name1:3306/name1',
'table-name' = 't1','username' = 'user123',
'password' = '123'
)"""
)
t_env.execute_sql(create_kafka_source)
t_env.execute_sql(create_kafka_dest)
stmt_set.add_insert_sql(
"INSERT INTO atable SELECT DISTINCT bla2,bla1,"
"FROM somethin"
)
Using DISTINCT in a streaming query is expensive, especially when there aren't any temporal constraints on the distinctiveness (e.g., counting unique visitors per day). I imagine that's why your query needs a lot of state.
However, you should be able to get this to work. RocksDB isn't always well-behaved; sometimes it will consume more memory than it has been allocated.
What version of Flink are you using? Improvements were made in Flink 1.11 (by switching to jemalloc) and further improvements came in Flink 1.14 (by upgrading to a newer version of RocksDB). So upgrading Flink might fix this. Otherwise you may need to basically lie and tell Flink it has somewhat less memory than it actually has, so that when RocksDB steps out of bounds it doesn't cause out-of-memory errors.

Event processing by using Flink SQL API

My Use case-
Collect events for a particular duration and then group them based on the key
Objective
After processing, user can save data of particular duration based on the key
How i am planning to do
1)Receive events from Kafka
2)Create data stream of events
3)associate a table with it and collect data for a particular duration by running a SQL query
4)associate a new table with step-2 output and group collected data according to the key
5)save the data in DB
Solution i tried-
I am able to-
1)receive events from Kafka,
2)setup a data stream(lets say sensorDataStream)-
DataStream<SensorEvent> sensorDataStream
= source.flatMap(new FlatMapFunction<String, SensorEvent>() {
#Override
public void flatMap(String catalog, Collector<SensorEvent> out) {
// create SensorEvent(id, sensor notification value, notification time) creation
});
3)associate a table(lets say table1) with data stream and after running SQL query like-
SELECT id, sensorNotif, notifTime FROM SENSORTABLE WHERE notifTime > t1_Timestamp AND notifTime < t2_Timestamp.
Here t1_Timestamp and t2_Timestamp is predefined epoch time and will change based on some predefined conditions
4)I am able to print this sql query result by using following query on the console-
tableEnv.toAppendStream(table1, Row.class).print();
5)Created a new table(lets say table2) by using table1 and following type of sql query-
Table table2 = tableEnv.sqlQuery("SELECT id AS SensorID, COUNT(sensorNotif) AS SensorNotificationCount FROM table1 GROUP BY id);
6)Collecting and print data by using -
tableEnv.toRetractStream(table2 , Row.class).print();
Problem
1)I am not able to see output of step 6 on the console.
I did some experiment and found that If i skip table1 setup step(that means no sensor data clubbing for a duration) and directly associate my senserDataStream with table2 then i can see the output of step-6 but as this is RetractStream so i can see data in the form of and if new event is coming then this retract stream will invalidate data and print newly calculated data.
Suggestion i would like to have
1)How can i merge step 5 and step 6(means table1 and table2). I already merged these tables but as data is not visible on console so i have doubt? Am i doing something wrong? Or data is merged but not visible?
2)My plan is to --
2.a)filter data in 2 pass, in first pass filter data for a particular interval and in second pass group this data
2.b)Save 2.a output in DB
Will this approach work(i have doubt because i am using data stream and table1 out put is append stream but table2 output is retract stream)?

SQL CLR Trigger - get source table

I am creating a DB synchronization engine using SQL CLR Triggers in Microsoft SQL Server 2012. These triggers do not call a stored procedure or function (and thereby have access to the INSERTED and DELETED pseudo-tables but do not have access to the ##procid).
Differences here, for reference.
This "sync engine" uses mapping tables to determine what the table and field maps are for this sync job. In order to determine the target table and fields (from my mapping table) I need to get the source table name from the trigger itself. I have come across many answers on Stack Overflow and other sites that say that this isn't possible. But, I've found one website that provides a clue:
Potential Solution:
using (SqlConnection lConnection = new SqlConnection(#"context connection=true")) {
SqlCommand cmd = new SqlCommand("SELECT object_name(resource_associated_entity_id) FROM sys.dm_tran_locks WHERE request_session_id = ##spid and resource_type = 'OBJECT'", lConnection);
cmd.CommandType = CommandType.Text;
var obj = cmd.ExecuteScalar();
}
This does in fact return the correct table name.
Question:
My question is, how reliable is this potential solution? Is the ##spid actually limited to this single trigger execution? Or is it possible that other simultaneous triggers will overlap within this process id? Will it stand up to multiple executions of the same and/or different triggers within the database?
From these sites, it seems the process Id is in fact limited to the open connection, which doesn't overlap: here, here, and here.
Will this be a safe method to get my source table?
Why?
As I've noticed similar questions, but all without a valid answer for my specific situation (except that one). Most of the comments on those sites ask "Why?", and in order to preempt that, here is why:
This synchronization engine operates on a single DB and can push changes to target tables, transforming the data with user-defined transformations, automatic source-to-target type casting and parsing and can even use the CSharpCodeProvider to execute methods also stored in those mapping tables for transforming data. It is already built, quite robust and has good performance metrics for what we are doing. I'm now trying to build it out to allow for 1:n table changes (including extension tables requiring the same Id as the 'master' table) and am trying to "genericise" the code. Previously each trigger had a "target table" definition hard coded in it and I was using my mapping tables to determine the source. Now I'd like to get the source table and use my mapping tables to determine all the target tables. This is used in a medium-load environment and pushes changes to a "Change Order Book" which a separate server process picks up to finish the CRUD operation.
Edit
As mentioned in the comments, the query listed above is quite "iffy". It will often (after a SQL Server restart, for example) return system objects like syscolpars or sysidxstats. But, it seems that in the dm_tran_locks table there's always an associated resource_type of 'RID' (Row ID) with the same object_name. My current query which works reliably so far is the following (will update if this changes or doesn't work under high load testing):
select t1.ObjectName FROM (
SELECT object_name(resource_associated_entity_id) as ObjectName
FROM sys.dm_tran_locks WHERE resource_type = 'OBJECT' and request_session_id = ##spid
) t1 inner join (
SELECT OBJECT_NAME(partitions.OBJECT_ID) as ObjectName
FROM sys.dm_tran_locks
INNER JOIN sys.partitions ON partitions.hobt_id = dm_tran_locks.resource_associated_entity_id
WHERE resource_type = 'RID'
) t2 on t1.ObjectName = t2.ObjectName
If this is always the case, I'll have to find that out during testing.
How reliable is this potential solution?
While I do not have time to set up a test case to show it not working, I find this approach (even taking into account the query in the Edit section) "iffy" (i.e. not guaranteed to always be reliable).
The main concerns are:
cascading (whether recursive or not) Trigger executions
User (i.e. Explicit / Implicit) transactions
Sub-processes (i.e. EXEC and sp_executesql)
These scenarios allow for multiple objects to be locked, all at the same time.
Is the ##SPID actually limited to this single trigger execution? Or is it possible that other simultaneous triggers will overlap within this process id?
and (from a comment on the question):
I think I can join my query up with the sys.partitions and get a dm_trans_lock that has a type of 'RID' with an object name that will match up to the one in my original query.
And here is why it shouldn't be entirely reliable: the Session ID (i.e. ##SPID) is constant for all of the requests on that Connection). So all sub-processes (i.e. EXEC calls, sp_executesql, Triggers, etc) will all be on the same ##SPID / session_id. So, between sub-processes and User Transactions, you can very easily get locks on multiple resources, all on the same Session ID.
The reason I say "resources" instead of "OBJECT" or even "RID" is that locks can occur on: rows, pages, keys, tables, schemas, stored procedures, the database itself, etc. More than one thing can be considered an "OBJECT", and it is possible that you will have page locks instead of row locks.
Will it stand up to multiple executions of the same and/or different triggers within the database?
As long as these executions occur in different Sessions, then they are a non-issue.
ALL THAT BEING SAID, I can see where simple testing would show that your current method is reliable. However, it should also be easy enough to add more detailed tests that include an explicit transaction that first does some DML on another table, or have a trigger on one table do some DML on one of these tables, etc.
Unfortunately, there is no built-in mechanism that provides the same functionality that ##PROCID does for T-SQL Triggers. I have come up with a scheme that should allow for getting the parent table for a SQLCLR Trigger (that takes into account these various issues), but haven't had a chance to test it out. It requires using a T-SQL trigger, set as the "first" trigger, to set info that can be discovered by the SQLCLR Trigger.
A simpler form can be constructed using CONTEXT_INFO, if you are not already using it for something else (and if you don't already have a "first" Trigger set). In this approach you would still create a T-SQL Trigger, and then set it as the "first" Trigger using sp_settriggerorder. In this Trigger you SET CONTEXT_INFO to the table name that is the parent of ##PROCID. You can then read CONTEXT_INFO() on a Context Connection in a SQLCLR Trigger. If there are multiple levels of Triggers then the value of CONTEXT INFO will get overwritten, so reading that value must be the first thing you do in each SQLCLR Trigger.
This is an old thread, but it is an FAQ and I think I have a better solution. Essentially it uses the schema of the inserted or deleted table to find the base table by doing a hash of the column names and comparing the hash with the hashes of tables with a CLR trigger on them.
Code snippet below - at some point I will probably put the whole solution on Git (it sends a message to Azure Service Bus when the trigger fires).
private const string colqry = "select top 1 * from inserted union all select top 1 * from deleted";
private const string hashqry = "WITH cols as ( "+
"select top 100000 c.object_id, column_id, c.[name] "+
"from sys.columns c "+
"JOIN sys.objects ot on (c.object_id= ot.parent_object_id and ot.type= 'TA') " +
"order by c.object_id, column_id ) "+
"SELECT s.[name] + '.' + o.[name] as 'TableName', CONVERT(NCHAR(32), HASHBYTES('MD5',STRING_AGG(CONVERT(NCHAR(32), HASHBYTES('MD5', cols.[name]), 2), '|')),2) as 'MD5Hash' " +
"FROM cols "+
"JOIN sys.objects o on (cols.object_id= o.object_id) "+
"JOIN sys.schemas s on (o.schema_id= s.schema_id) "+
"WHERE o.is_ms_shipped = 0 "+
"GROUP BY s.[name], o.[name]";
public static void trgSendSBMsg()
{
string table = "";
SqlCommand cmd;
SqlDataReader rdr;
SqlTriggerContext trigContxt = SqlContext.TriggerContext;
SqlPipe p = SqlContext.Pipe;
using (SqlConnection con = new SqlConnection("context connection=true"))
{
try
{
con.Open();
string tblhash = "";
using (cmd = new SqlCommand(colqry, con))
{
using (rdr = cmd.ExecuteReader(CommandBehavior.SingleResult))
{
if (rdr.Read())
{
MD5 hash = MD5.Create();
StringBuilder hashstr = new StringBuilder(250);
for (int i=0; i < rdr.FieldCount; i++)
{
if (i > 0) hashstr.Append("|");
hashstr.Append(GetMD5Hash(hash, rdr.GetName(i)));
}
tblhash = GetMD5Hash(hash, hashstr.ToString().ToUpper()).ToUpper();
}
rdr.Close();
}
}
using (cmd = new SqlCommand(hashqry, con))
{
using (rdr = cmd.ExecuteReader(CommandBehavior.SingleResult))
{
while (rdr.Read())
{
string hash = rdr.GetString(1).ToUpper();
if (hash == tblhash)
{
table = rdr.GetString(0);
break;
}
}
rdr.Close();
}
}
if (table.Length == 0)
{
p.Send("Error: Unable to find table that CLR trigger is on. Message not sent!");
return;
}
….
HTH

How table was created in SQL-Server

What I need to find is the procedure of recreating some table, what data sources were used, which scripts if any &c. So is it possible to differentiate somehow, maybe in system views or similar, if the table was created manually or by query and if the data was imported from external data or from already existing table/view in the database? I already know who created and when. I’ve pretty much screened whole database without results and now I am looking for hints in metadata.
If the table was created recently, you can glean information from the default trace. The query below will list object created and altered events. The default trace is a rollover trace so forensic information will be limited based on activity.
SELECT
trace.DatabaseName
,trace.ObjectName
,te.name AS EventName
,tsv.subclass_name
,trace.EventClass
,trace.EventSubClass
,trace.StartTime
,trace.EndTime
,trace.NTDomainName
,trace.NTUserName
,trace.HostName
,trace.ApplicationName
,trace.Spid
FROM (SELECT REVERSE(STUFF(REVERSE(path), 1, CHARINDEX(N'\', REVERSE(path)), '')) + N'\Log.trc' AS path
FROM sys.traces WHERE is_default = 1) AS default_trace_path
CROSS APPLY fn_trace_gettable(default_trace_path.path, DEFAULT) AS trace
JOIN sys.trace_events AS te ON
trace.EventClass=te.trace_event_id
JOIN sys.trace_subclass_values AS tsv ON
tsv.trace_event_id = EventClass
AND tsv.subclass_value = trace.EventSubClass
WHERE te.name IN(N'Object:Altered', N'Object:Created')
AND tsv.subclass_name = 'Commit'
ORDER BY trace.StartTime;

Correct method of deleting over 2100 rows (by ID) with Dapper

I am trying to use Dapper support my data access for my server app.
My server app has another application that drops records into my database at a rate of 400 per minute.
My app pulls them out in batches, processes them, and then deletes them from the database.
Since data continues to flow into the database while I am processing, I don't have a good way to say delete from myTable where allProcessed = true.
However, I do know the PK value of the rows to delete. So I want to do a delete from myTable where Id in #listToDelete
Problem is that if my server goes down for even 6 mintues, then I have over 2100 rows to delete.
Since Dapper takes my #listToDelete and turns each one into a parameter, my call to delete fails. (Causing my data purging to get even further behind.)
What is the best way to deal with this in Dapper?
NOTES:
I have looked at Tabled Valued Parameters but from what I can see, they are not very performant. This piece of my architecture is the bottle neck of my system and I need to be very very fast.
One option is to create a temp table on the server and then use the bulk load facility to upload all the IDs into that table at once. Then use a join, EXISTS or IN clause to delete only the records that you uploaded into your temp table.
Bulk loads are a well-optimized path in SQL Server and it should be very fast.
For example:
Execute the statement CREATE TABLE #RowsToDelete(ID INT PRIMARY KEY)
Use a bulk load to insert keys into #RowsToDelete
Execute DELETE FROM myTable where Id IN (SELECT ID FROM #RowsToDelete)
Execute DROP TABLE #RowsToDelte (the table will also be automatically dropped if you close the session)
(Assuming Dapper) code example:
conn.Open();
var columnName = "ID";
conn.Execute(string.Format("CREATE TABLE #{0}s({0} INT PRIMARY KEY)", columnName));
using (var bulkCopy = new SqlBulkCopy(conn))
{
bulkCopy.BatchSize = ids.Count;
bulkCopy.DestinationTableName = string.Format("#{0}s", columnName);
var table = new DataTable();
table.Columns.Add(columnName, typeof (int));
bulkCopy.ColumnMappings.Add(columnName, columnName);
foreach (var id in ids)
{
table.Rows.Add(id);
}
bulkCopy.WriteToServer(table);
}
//or do other things with your table instead of deleting here
conn.Execute(string.Format(#"DELETE FROM myTable where Id IN
(SELECT {0} FROM #{0}s", columnName));
conn.Execute(string.Format("DROP TABLE #{0}s", columnName));
To get this code working, I went dark side.
Since Dapper makes my list into parameters. And SQL Server can't handle a lot of parameters. (I have never needed even double digit parameters before). I had to go with Dynamic SQL.
So here was my solution:
string listOfIdsJoined = "("+String.Join(",", listOfIds.ToArray())+")";
connection.Execute("delete from myTable where Id in " + listOfIdsJoined);
Before everyone grabs the their torches and pitchforks, let me explain.
This code runs on a server whose only input is a data feed from a Mainframe system.
The list I am dynamically creating is a list of longs/bigints.
The longs/bigints are from an Identity column.
I know constructing dynamic SQL is bad juju, but in this case, I just can't see how it leads to a security risk.
Dapper request the List of object having parameter as a property so in above case a list of object having Id as property will work.
connection.Execute("delete from myTable where Id in (#Id)", listOfIds.AsEnumerable().Select(i=> new { Id = i }).ToList());
This will work.

Resources