How to use Apache Flink with lookup data? - apache-flink

.
Hi,
using Apache Flink 1.8. I have a stream of records coming in from Kafka as JSON and filtering them and that all works fine.
Now, I would like to enrich the data from Kafka with a look up value from a database table.
Is that just a case of creating 2 streams, loading the table in the 2nd stream and then joining the data?
The database table does get updated but not frequently and I would like to avoid looking up the DB on every record that comes through the stream.

Flink has state, which you could take advantage of here. I've done something similar, where I took a daily query from my lookup table (in my case it was a bulk webservice call) and through the results into a kafka topic. This kafka topic was being consumed by the same service flink job as that needed the data for lookups. Both topics were keyed by the same value, but I used the lookup topic to store data into a keyed state, and when processing the other topic, I'd pull the data back out of state.
I had some additional logic to check if there was NO state yet for a given key. If that was the case, I'd make an async request to the webservice. You may not need to do that however.
The caveat here is that I had memory for state management, and my lookup table was only about 30-million records, about 100 gigs spread across 45 slots on 15 nodes.
[In answer to question in comments]
Sorry, but my answer was too long, so had to edit my post:
I had a python job that loaded the data via a bulk REST call (yours could just do a data lookup). It then transformed the data into the correct format and dumped it into Kafka. Then my flink flow had two sources, one was the 'real data' topic, the other was the 'lookup data' topic. Data coming from the lookup data topic was stored in state (I used a ValueState because each key mapped to a single possible value, but there are other state types. I also had a 24 hour expiration time for each entry, but that was my usecase.
The trick is that the same operation that stores the value in state from the lookup topic, has to be the operation that pulls the value back out of state from the 'real' topic. This is because flink state (even keyed states) are tied to the operator that created them.

Related

Do the Snowflake streams have any limitation on amount of data they can process

I am considering using streams and tasks for data transforming and transferring from the CDC table created by the Kafka Snowflake connector (in JSON format) into the fully structured Snowflake tables.
I am wondering is there any limitation on the amount of data the streams can process. I am talking about processing millions of records per day.
Has someone already tested the streams on Big Data?
Thanks in advance.
The stream is just an offset pointer. It doesn't store a copy of the data. As long as the data is in the base table, it will be in the stream, and have the same performance characteristics as the base table. There are two additional change tracking columns on the base table, so it would be similar to a 'where' condition on those columns.

Flink :Write retract stream to kafka sink

Can anyone share working example to write retract stream to kafka sink?
I tried as below which is not working.
DataStream<Tuple2<Boolean, User>> resultStream =
tEnv.toRetractStream(result, User.class);
resultStream.addsink(new FlinkKafkaProducer(OutputTopic, new ObjSerializationSchema(OutputTopic),
props, FlinkKafkaProducer.Semantic.EXACTLY_ONCE))
Generally the simplest solution would be to do smth like:
resultStream.map(elem -> elem.f1)
This will allow You to write the User objects to Kafka.
But this isn't really that simple from business point of view or at least it depends on the use-case. Kafka is an append-only log and retract stream represents ADD, UPDATE and DELETE operations. So, while the solution above will allow You to write the data to Kafka, the results in Kafka may not correctly represent the actual computation results, because they won't represent update and delete operations.
To be able to write actual correct computation results to Kafka You may try to do one of the following things:
If You know that Your use-case will never cause any DELETE or UPDATE operations then You can safely use the solution above.
If the duplicates may be only produced in some regular intervals (for example record may only be updated/deleted after 1 hr after they are produced), you may want to use windows to aggregate all updates and write one final result to Kafka.
Finally, You can extend User class to add a field which marks whether this record is a retract operation and keep that information when writing data to Kafka topic. This means that You would have to handle all possible UPDATE or DELETE operations downstream (in the consumer of this data).
The easiest solution would be to use the upsert-kafka connector, as a table sink. This is designed to consume a retract stream and write it to kafka.
https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/upsert-kafka.html

Cassandra - Handling partition and bucket for large data size

We have a requirement where application reads file and inserts data in Cassandra database, however the table can grow up to 300+ MB in one shot during the day.
The table will have below structure
create table if not exists orders (
id uuid,
record text,
status varchar,
create_date timestamp,
modified_date timestamp,
primary key (status, create_date));
'Status' column can have value [Started, Completed, Done]
As per couple of documents on internet, READ performance is best if it's < 100 MB and index should be used on a column that's least modified (so I cannot use 'status' column as index). Also if I use buckets with TWCS as Minutes then there will be lots of buckets and may impact.
So, how can I better make use of partitions and/or buckets for inserting evenly across partitions and reading records with appropriate status.
Thank you in advance.
From the discussion in the comments it looks like you are trying to use Cassandra as a queue and that is a big anti-pattern.
While you could store data about the operations you've done in Cassandra, you should look for something like Kafka or RabbitMQ for the queuing.
It could look something like this:
Application 1 copies/generates record A;
Application 1 adds the path of A to a queue;
Application 1 upserts to cassandra in a partition based on the file id/path (the other columns can be info such as date, time to copy, file hash etc);
Application 2 reads the queue, find A, processes it and determines if it is a failure or if it's completed;
Application 2 upserts to cassandra information about the processing including the status. You can also have stuff like reason for the failure;
If it is a failure then you can write the path/id to another topic.
So to sum it up, don't try to use Cassandra as a queue, that is a globally accepted anti-pattern. You can and should use Cassandra to persist a log of what you have done, including maybe the results of the processing (if applicable), how files were processed, their result and so on.
Depending on how you would further need to read and use the data in Cassandra you could think about using partitions and buckets based on stuff like, source of the file, type of file etc. If not, you could keep it partitioned by a unique value like the UUID I've seen in your table. Then you could maybe come to get info about it based on that.
Hope this heleped,
Cheers!

flink - how to use state as cache

I want to read history from state. if state is null, then read hbase and update the state and using onTimer to set state ttl. The problem is how to batch read hbase, because read single record from hbase is not efficient.
In general, if you want to cache/mirror state from an external database in Flink, the most performant approach is to stream the database mutations into Flink -- in other words, turn Flink into a replication endpoint for the database's change data capture (CDC) stream, if the database supports that.
I have no experience with hbase, but https://github.com/mravi/hbase-connect-kafka is an example of something that might work (by putting kafka in-between hbase and flink).
If you would rather query hbase from Flink, and want to avoid making point queries for one user at a time, then you could build something like this:
-> queryManyUsers -> keyBy(uId) ->
streamToEnrich CoProcessFunction
-> keyBy(uID) ------------------->
Here you would split your stream, sending one copy through something like a window or process function or async i/o to query hbase in batches, and send the results into a CoProcessFunction that holds the cache and does the enrichment.
When records arrive in this CoProcessFunction directly, along the bottom path, if the necessary data is in the cache, then it is used. Otherwise the record is buffered, pending the arrival of data for the cache from the upper path.

Idempotent queries of time series data in Camel

I often use Camel's idempotent pattern to prevent duplicate processing of discrete messages. What's the best practice to do this when the data stream in question is a large volume of messages each with a timestamp?
Consider this route configuration (pseudocode):
timer -> idempotent( search_splunk_as_batch -> split -> sql(insert))
We want to periodically query from splunk and write to sql. We don't want to miss any messages and we don't want any duplicate messages.
Instead of persisting an idempotent marker for each message, I'd like to note the cutoff time for each batch and begin the next query at the cutoff time.
Your method will probably work as long as you can rely on some assumptions:
Your indexers never load data that appears in the past (according to the _time field)
Your camel route is never running in more than one process at a time that is sending to the same database table.
If you can make sure these are met, then you can just store the maximum timestamp that you receive from the search and use that with the "earliest" parameter of the splunk search command. Storing and retrieving the max timestamp could be done with something like a file, a separate database table, or using a column in your target table.

Resources