Apache Flink: Write a DataStream to a Postgres table

Apache Flink: Write a DataStream to a Postgres table - apache-flink

I'm trying to code a streaming job which sinks a data stream into a postgres table. To give full information, I based my work on articles : https://tech.signavio.com/2017/postgres-flink-sink which propose to use JDBCOutputFormat.
My code looks like the following:
98 ...
99 String strQuery = "INSERT INTO public.alarm (entity, duration, first, type, windowsize) VALUES (?, ?, ?, 'dur', 6)";
100
101 JDBCOutputFormat jdbcOutput = JDBCOutputFormat.buildJDBCOutputFormat()
102 .setDrivername("org.postgresql.Driver")
103 .setDBUrl("jdbc:postgresql://localhost:5432/postgres?user=michel&password=polnareff")
104 .setQuery(strQuery)
105 .setSqlTypes(new int[] { Types.VARCHAR, Types.INTEGER, Types.VARCHAR}) //set the types
106 .finish();
107
108 DataStream<Row> rows = FilterStream
109 .map((tuple)-> {
110 Row row = new Row(3); // our prepared statement has 3 parameters
111 row.setField(0, tuple.f0); // first parameter is case ID
112 row.setField(1, tuple.f1); // second paramater is tracehash
113 row.setField(2, f.format(tuple.f2)); // third paramater is tracehash
114 return row;
115 });
116
117 rows.writeUsingOutputFormat(jdbcOutput);
118
119 env.execute();
120
121 }
122 }
My problem now is that values are inserted only when my job is stopped (to be precise, when I cancel my job from apache flink dashboard).
So my question is the following: Did I miss something ? Should I commit somewhere the rows I inserted ?
Best regards,
Ignatius

As Chesnay said in his comment, you have to adapt the batch interval.
However this is not the full story. If you want to achieve at-least once results, you have to sync the batch writes with Flink's checkpoints. Basically, you have to wrap the JdbcOutputFormat in a SinkFunction that also implements the CheckpointedFunction interface. When the snapshotState() is called, you have write the batch to the database. You can have a look at this pull request that will provide this functionality in the next release.

Fabian's answer is one way to achieve at-least-once semantics; by syncing the writes with Flink's checkpoints. However, this has the disadvantage that your Sink's data freshness is now tight to your checkpointing interval cycle.
As an alternative, you could store your tuples or rows that have (entity, duration, first) fields in Flink's own managed state so Flink takes care of checkpointing it (in other words, make your Sink's state fault-tolerant). To do that, you implements CheckpointedFunction and CheckpointedRestoring interfaces (without having to sync your writes with checkpoints. You can even executes your SQL inserts individually if you do not have to use JDBCOutputFormat). See: https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/state.html#using-managed-operator-state. Another solution is to implement ListCheckpointed interface only (can be used in a similar way as the deprecated CheckpointedRestoring interface, and supports list-style state redistribution).

Related

Flink CEP sql restrict output

I have a use case where I have 2 input topics in kafka.
Topic schema:
eventName, ingestion_time(will be used as watermark), orderType, orderCountry
Data for first topic:
{"eventName": "orderCreated", "userId":123, "ingestionTime": "1665042169543", "orderType":"ecommerce","orderCountry": "UK"}
Data for second topic:
{"eventName": "orderSucess", "userId":123, "ingestionTime": "1665042189543", "orderType":"ecommerce","orderCountry": "USA"}
I want to get all the userid for orderType,orderCountry where user does first event but not the second one in a window of 5 minutes for a maximum of 2 events per user for a orderType and orderCountry (i.e. upto 10 mins only).
I have union both topics data and created a view on top of it and trying to use flink cep sql to get my output, but somehow not able to figure it out.
SELECT *
FROM union_event_table
MATCH_RECOGNIZE(
PARTITION BY orderType,orderCountry
ORDER BY ingestion_time
MEASURES
A.userId as userId
A.orderType as orderType
A.orderCountry AS orderCountry
ONE ROW PER MATCH
PATTERN (A not followed B) WITHIN INTERVAL '5' MINUTES
DEFINE
A As A.eventName = 'orderCreated'
B AS B.eventName = 'orderSucess'
)
First thing is not able to figure it out what to use in place of A not followed B in sql, another thing is how can I restrict the output for a userid to maximum of 2 events per orderType and orderCountry, i.e. if a user doesn't perform 2nd event after 1st event in 2 consecutive windows for 5 minutes, the state of that user should be removed, so that I will not get output of that user for same orderType and orderCountry again.

I don't believe this is possible using MATCH_RECOGNIZE. This could, however, be implemented with the DataStream CEP library by using its capability to send timed out patterns to a side output.
This could also be solved at a lower level by using a KeyedProcessFunction. The long ride alerts exercise from the Apache Flink Training repo is an example of that -- you can jump straight away to the solution if you want.

Fraud Detection DataStream API tutorial questions

I am following the tutorial here.
Q1: Why in the final application do we clear all states and delete timer whenever flagState = true regardless of the current transaction amount? I refer to this part of the code:
// Check if the flag is set
if (lastTransactionWasSmall != null) {
if (transaction.getAmount() > LARGE_AMOUNT) {
//Output an alert downstream
Alert alert = new Alert();
alert.setId(transaction.getAccountId());
collector.collect(alert);
}
// Clean up our state [WHY HERE?]
cleanUp(context);
}
If the datastream for a transaction was 0.5, 10, 600, then flagState would be set for 0.5 then cleared for 10. So for 600, we skip the code block above and don't check for large amount. But if 0.5 and 600 transactions occurred within a minute, we should have sent an alert but we didn't.
Q2: Why do we use processing time to determine whether two transactions are 1 minute apart? The transaction class has a timeStamp field so isn't it better to use event time? Since processing time will be affected by the speed of the application, so two transactions with event times within 1 minute of each other could be processed > 1 minute apart due to lag.

A1: The fraud model being used in this example is explained by this figure:
In your example, the transaction 600 must immediately follow the transaction for 0.5 to be considered fraud. Because of the intervening transaction for 10, it is not fraud, even if all three transactions occur within a minute. It's just a matter of how the use case was framed.
A2: Doing this with event time would be a very valid choice, but would make the example much more complex. Not only would watermarks be required, but we would also have to sort the stream by event time, since a realistic example would have to consider that the events might be out-of-order.
At that point, implementing this with a process function would no longer be the best choice. Using the temporal pattern matching capabilities of either Flink's CEP library or Flink SQL with MATCH_RECOGNIZE would be the way to go.

Flink or kafka stream in case where any change in stream result in processing all data

I have a use case where lets i get balances based on date and I want to show correct balances of each day. If get an update on older date all my balances of that account from that date gets changed.
for eg
Account Date balance Total balance
IBM 1Jun 100 100
IBM 2Jun 50 150
IBM 10Jun 200 350
IBM 12Jun 200 550
Now I get a message of date 4 Jun (this is the scenario some transaction is done back dated, or some correction and its frequent scenario)
Account Date balance Total balance
IBM 1Jun 100 100
IBM 2Jun 50 150
IBM 4Jun 300 450 ----------- all data from this point changes
IBM 10Jun 200 650
IBM 12Jun 200 850
Its a streaming data and at any point I want the correct balance to be shown for each account.
I know flink and kafka are good for streaming use case where if an update of a particular date doesnt trigger update on all data from that point onwards. But can we achieve this scenario as well efficiently or is this NOT a use case of these streaming tech at all ?
Please help

You can't modify a past message in the queue, therefore you should introduce a new message that invalidates previous one. For instance, you can use an ID for each transaction (and repeat it if you need to modified it). In case you have two or more messages with the same ID, you keep with the last one.
Take a look to KTable from Kafka Streams. It can help you to aggregate data using that ID (or any other aggregation factor) and generate a table as a result with the valid resume until now. If a new message arrives, table updates will be emitted

Datastore Read Operations Calculation

So I am currently performing a test, to estimate how much can my Google app engine work, without going over quotas.
This is my test:
I have in the datastore an entity, that according to my local
dashboard, needs 18 write operations. I have 5 entries of this type
in a table.
Every 30 seconds, I fetch those 5 entities mentioned above. I DO
NOT USE MEMCACHE FOR THESE !!!
That means 5 * 18 = 90 read operations, per fetch right ?
In 1 minute that means 180, in 1 hour that means 10800 read operations..Which is ~20% of the daily limit quota...
However, after 1 hour of my test running, I noticed on my online dashboard, that only 2% of the read operations were used...and my question is why is that?...Where is the flaw in my calculations ?
Also...where can I see in the online dashboard how many read/write operations does an entity need?
Thanks

A write on your entity may need 18 writes, but a get on your entity will cost you only 1 read.
So if you get 5 entries every 30 secondes during one hour, you'll have about 5reads * 120 = 600 reads.
This is in the case you make a get on your 5 entries. (fetching the entry with it's id)
If you make a query to fetch them, the cost is "1 read + 1 read per entity retrieved". Wich mean 2 reads per entries. So around 1200 reads in one hour.
For more details informations, here is the documentation for estimating costs.
You can't see on the dashboard how many writes/reads operations an entity need. But I invite you to check appstats for that.

How to order the ngrams in Google's database (or the one hosted on AWS) by frequency

I'm looking for a way to order Google Book's Ngram's by frequency.
The original dataset is here: http://books.google.com/ngrams/datasets. Inside each file the ngrams are sorted alphabetically and then chronologically.
My computer is not powerful enough to handle 2.2 TB worth of data, so I think the only way to sort this would be "in the cloud".
The AWS-hosted version is here: http://aws.amazon.com/datasets/8172056142375670.
Is there a financially efficient way to find the 10,000 most frequent 1grams, 2grams, 3grams, 4grams, and 5grams?
To throw a wrench in it, the datasets contain data for multiple years:
As an example, here are the 30,000,000th and 30,000,001st lines from file 0
of the English 1-grams (googlebooks-eng-all-1gram-20090715-0.csv.zip):
circumvallate 1978 313 215 85
circumvallate 1979 183 147 77
The first line tells us that in 1978, the word "circumvallate" (which means
"surround with a rampart or other fortification", in case you were wondering)
occurred 313 times overall, on 215 distinct pages and in 85 distinct books
from our sample.
Ideally, the frequency lists would only contain data from 1980-present (the sum of each year).
Any help would be appreciated!
Cheers,

I would recommend using Pig!
Pig makes things like this very easy and straight-forward. Here's a sample pig script that does pretty much what you need:
raw = LOAD '/foo/input' USING PigStorage('\t') AS (ngram:chararray, year:int, count:int, pages:int, books:int);
filtered = FILTER raw BY year >= 1980;
grouped = GROUP filtered BY ngram;
counts = FOREACH grouped GENERATE group AS ngram, SUM(filtered.count) AS count;
sorted = ORDER counts BY count DESC;
limited = LIMIT sorted 10000;
STORED limited INTO '/foo/output' USING PigStorage('\t');
Pig on AWS Elastic MapReduce can even operate directly on S3 data, so you would probably replace /foo/input and /foo/output with S3 buckets too.