I want to read history from state. if state is null, then read hbase and update the state and using onTimer to set state ttl. The problem is how to batch read hbase, because read single record from hbase is not efficient.
In general, if you want to cache/mirror state from an external database in Flink, the most performant approach is to stream the database mutations into Flink -- in other words, turn Flink into a replication endpoint for the database's change data capture (CDC) stream, if the database supports that.
I have no experience with hbase, but https://github.com/mravi/hbase-connect-kafka is an example of something that might work (by putting kafka in-between hbase and flink).
If you would rather query hbase from Flink, and want to avoid making point queries for one user at a time, then you could build something like this:
-> queryManyUsers -> keyBy(uId) ->
streamToEnrich CoProcessFunction
-> keyBy(uID) ------------------->
Here you would split your stream, sending one copy through something like a window or process function or async i/o to query hbase in batches, and send the results into a CoProcessFunction that holds the cache and does the enrichment.
When records arrive in this CoProcessFunction directly, along the bottom path, if the necessary data is in the cache, then it is used. Otherwise the record is buffered, pending the arrival of data for the cache from the upper path.
Related
Can anyone share working example to write retract stream to kafka sink?
I tried as below which is not working.
DataStream<Tuple2<Boolean, User>> resultStream =
tEnv.toRetractStream(result, User.class);
resultStream.addsink(new FlinkKafkaProducer(OutputTopic, new ObjSerializationSchema(OutputTopic),
props, FlinkKafkaProducer.Semantic.EXACTLY_ONCE))
Generally the simplest solution would be to do smth like:
resultStream.map(elem -> elem.f1)
This will allow You to write the User objects to Kafka.
But this isn't really that simple from business point of view or at least it depends on the use-case. Kafka is an append-only log and retract stream represents ADD, UPDATE and DELETE operations. So, while the solution above will allow You to write the data to Kafka, the results in Kafka may not correctly represent the actual computation results, because they won't represent update and delete operations.
To be able to write actual correct computation results to Kafka You may try to do one of the following things:
If You know that Your use-case will never cause any DELETE or UPDATE operations then You can safely use the solution above.
If the duplicates may be only produced in some regular intervals (for example record may only be updated/deleted after 1 hr after they are produced), you may want to use windows to aggregate all updates and write one final result to Kafka.
Finally, You can extend User class to add a field which marks whether this record is a retract operation and keep that information when writing data to Kafka topic. This means that You would have to handle all possible UPDATE or DELETE operations downstream (in the consumer of this data).
The easiest solution would be to use the upsert-kafka connector, as a table sink. This is designed to consume a retract stream and write it to kafka.
https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/upsert-kafka.html
My flink job as of now does KeyBy on client id and thes uses window operator to accumulate data for 1 minute and then aggregates data. After aggregation we sink these accumulated data in hdfs files. Number of unique keys(client id) are more than 70 millions daily.
Issue is when we do keyBy it distributes data on cluster(my assumption) but i want data to be aggregated for 1 minute on same slot(or node) for incoming events.
NOTE : In sink we can have multiple data for same client for 1 minute window. I want to save network calls.
You're right that doing a stream.keyBy() will cause network traffic when the data is partitioned/distributed (assuming you have parallelism > 1, of course). But the standard window operators require a keyed stream.
You could create a ProcessFunction that implements the CheckpointedFunction interface, and use that to maintain state in an unkeyed stream. But you'd still have to implement your own timers (standard Flink timers require a keyed stream), and save the time windows as part of the state.
You could write your own custom RichFlatMapFunction, and have an in-memory Map<time window, Map<ip address, count>> do to pre-keyed aggregations. You'd still need to follow this with a keyBy() and window operation to do the aggregation, but there would be much less network traffic.
I think it's OK that this is stateless. Though you'd likely need to make this an LRU cache, to avoid blowing memory. And you'd need to create your own timer to flush the windows.
But the golden rule is to measure first, the optimize. As in confirming that network traffic really is a problem, before performing helicopter stunts to try to reduce it.
.
Hi,
using Apache Flink 1.8. I have a stream of records coming in from Kafka as JSON and filtering them and that all works fine.
Now, I would like to enrich the data from Kafka with a look up value from a database table.
Is that just a case of creating 2 streams, loading the table in the 2nd stream and then joining the data?
The database table does get updated but not frequently and I would like to avoid looking up the DB on every record that comes through the stream.
Flink has state, which you could take advantage of here. I've done something similar, where I took a daily query from my lookup table (in my case it was a bulk webservice call) and through the results into a kafka topic. This kafka topic was being consumed by the same service flink job as that needed the data for lookups. Both topics were keyed by the same value, but I used the lookup topic to store data into a keyed state, and when processing the other topic, I'd pull the data back out of state.
I had some additional logic to check if there was NO state yet for a given key. If that was the case, I'd make an async request to the webservice. You may not need to do that however.
The caveat here is that I had memory for state management, and my lookup table was only about 30-million records, about 100 gigs spread across 45 slots on 15 nodes.
[In answer to question in comments]
Sorry, but my answer was too long, so had to edit my post:
I had a python job that loaded the data via a bulk REST call (yours could just do a data lookup). It then transformed the data into the correct format and dumped it into Kafka. Then my flink flow had two sources, one was the 'real data' topic, the other was the 'lookup data' topic. Data coming from the lookup data topic was stored in state (I used a ValueState because each key mapped to a single possible value, but there are other state types. I also had a 24 hour expiration time for each entry, but that was my usecase.
The trick is that the same operation that stores the value in state from the lookup topic, has to be the operation that pulls the value back out of state from the 'real' topic. This is because flink state (even keyed states) are tied to the operator that created them.
I have to process a streaming log, such as
{"id":1, "name":"alice"}
each one log need to get the family address by accessing a mapping db. however, the data in db is changing.
So can i read the db in period to avoid io operation by each one log.
It seems you could solve your problem by having a custom RichMapFunction where you implement RichMapFunction#open to get the state from the database (and storing it in some data structure) before starting to process events.
You can then launch an auxiliary thread from that function which, from time to time, fetches the most up-to-date information from the database and update the data structure. This doesn't need any locking if you can fit the dataset in memory twice as you can just perform an atomic swap between the two data structures.
I often use Camel's idempotent pattern to prevent duplicate processing of discrete messages. What's the best practice to do this when the data stream in question is a large volume of messages each with a timestamp?
Consider this route configuration (pseudocode):
timer -> idempotent( search_splunk_as_batch -> split -> sql(insert))
We want to periodically query from splunk and write to sql. We don't want to miss any messages and we don't want any duplicate messages.
Instead of persisting an idempotent marker for each message, I'd like to note the cutoff time for each batch and begin the next query at the cutoff time.
Your method will probably work as long as you can rely on some assumptions:
Your indexers never load data that appears in the past (according to the _time field)
Your camel route is never running in more than one process at a time that is sending to the same database table.
If you can make sure these are met, then you can just store the maximum timestamp that you receive from the search and use that with the "earliest" parameter of the splunk search command. Storing and retrieving the max timestamp could be done with something like a file, a separate database table, or using a column in your target table.