Spark Streaming :
When there are code changes to spark streaming app., I have to clean the checkpoint to deploy new changes. Effectively i am loosing the historical state, which is really bad.
Is there a way we can save and rebuild State information from external Database like Cassandra, apart from regular check pointing, spark provides by default?
Can you please point me to some coding examples in this regard.
If you are using receiver less approach like Kafka direct API, then you can get the offset for the topic read and store it to Cassandra (or any db). And in your init you have to read the offset from the Cassandra and use it in direct API. By this way you can avoid checkpoint and upgrade your jar easily.
To access offset in dstream , examples can be seen here
offset reading example
Related
I'm new to Flink and I'm currently testing the framework for a usecase consisting in enriching transactions coming from Kafka with a lot of historical features (e.g number of past transactions between same source and same target), then score this transaction with a machine learning model.
For now, features are all kept in Flink states and the same job is scoring the enriched transaction. But I'd like to separate the features computation job from the scoring job and I'm not sure how to do this.
The queryable state doesn't seem to fit for this, as the job id is needed, but tell me if I'm wrong !
I've thought about querying directly RocksDB but maybe there's a more simple way ?
Is the separation in two jobs for this task a bad idea with Flink ? We do it this way for the same test with Kafka Streams, in order to avoid complex jobs (and to check if it has any positive impact on latency)
Some extra information : I'm using Flink 1.3 (but willing to upgrade if it's needed) and the code is written in Scala
Thanks in advance for your help !
Something like Kafka works well for this kind of decoupling. In that way you could have one job that computes the features and streams them out to a Kafka topic that is consumed by the job doing the scoring. (Aside: this would make it easy to do things like run several different models and compare their results.)
Another approach that is sometimes used is to call out to an external API to do the scoring. Async I/O could be helpful here. At least a couple of groups are using stream SQL to compute features, and wrapping external model scoring services as UDFs.
And if you do want to use queryable state, you could use Flink's REST api to determine the job id.
There have been several talks at Flink Forward conferences about using machine learning models with Flink. One example: Fast Data at ING – Building a Streaming Data Platform with Flink and Kafka.
There's an ongoing community effort to make all this easier. See FLIP-23 - Model Serving for details.
Is it possible to connect Oracle database and get live data stream into Google cloud pub/sub?
The short answer to your question is yes, but the longer more detailed answer includes making some assumptions like, when you say stream, do you literally mean stream or do you mean batch updating every minute?
Asking this question because there are huge implications depending on the answer meaning, if you require a true streaming solution, the only one is to bolt an Oracle product on top of your database called Oracle GoldenGate. This product is costly both in dollars and in engineering effort.
If a near real time solution is suitable to you, then you can use any of the following solutions:
NiFi
Or
Airflow
Luigi
With either plain SQL or using a streaming framework like Beam or Spark.
Or any other orchestration platform that can run queries on a timer. At the end of the day, all you need is something that can do select * from table where last_update > now() - threshold, generate an event for each delta, and then publish all the deltas to PubSub.
Yes, you can see a provided template at https://cloud.google.com/dataflow/docs/templates/provided-templates#gcstexttocloudpubsub that reads from Google Cloud Storage Text to Cloud Pub/Sub. You should be able to change the code that reads from storage to read from your database instead.
yes. I tried as part of 1 POC. Using triggers capture the changed records from Oracle, using a Cursor convert those into .txt file with data in JSON format.Prepare batch script to read the data and include Publish command inside the batch file to push the data through cloud PubSub. This is the overall flow
You can consider using Change data Capture(CDC) tools like Debezium that detect your db changes in real time.
Docs: https://debezium.io/documentation/reference/operations/debezium-server.html
With Spring boot: https://www.baeldung.com/debezium-intro
I am working on building an application with below requirements and I am just getting started with flink.
Ingest data into Kafka with say 50 partitions (Incoming rate - 100,000 msgs/sec)
Read data from Kafka and process each data (Do some computation, compare with old data etc) real time
Store the output on Cassandra
I was looking for a real time streaming platform and found Flink to be a great fit for both real time and batch.
Do you think flink is the best fit for my use case or should I use Storm, Spark streaming or any other streaming platforms?
Do I need to write a data pipeline in google data flow to execute my sequence of steps on flink or is there any other way to perform a sequence of steps for realtime streaming?
Say if my each computation take like 20 milliseconds, how can I better design it with flink and get better throughput.
Can I use Redis or Cassandra to get some data within flink for each computation?
Will I be able to use JVM in-memory cache inside flink?
Also can I aggregate data based on a key for some time window (example 5 seconds). For example lets say there are 100 messages coming in and 10 messages have the same key, can I group all messages with the same key together and process it.
Are there any tutorials on best practices using flink?
Thanks and appreciate all your help.
Given your task description, Apache Flink looks like a good fit for your use case.
In general, Flink provides low latency and high throughput and has a parameter to tune these. You can read and write data from and to Redis or Cassandra. However, you can also store state internally in Flink. Flink does also have sophisticated support for windows. You can read the blog on the Flink website, check out the documentation for more information, or follow this Flink training to learn the API.
I am looking to migrate from a homegrown streaming server to Apache Flink. One thing that we have is a Apache Storm like DRPC interface to run queries against the state held in the processing topology.
So for example: I have a bunch of sensors that I am running an moving average on. I want to run a query on the topology and return all the sensors where that average is above a fixed value.
Is there an equivalent in Flink, or if not, what is the best way to achieve equivalent functionality?
Out-of-box Flink does not come with a solution for querying the internal state of operations right now. You're lucky however, because there are two solutions: We did an example of a stateful word count example that allows querying the state. This is available here: https://github.com/dataArtisans/query-window-example
For one of the upcoming versions of Flink we are also working on a generic solution to the queryable state use case. This will allow querying the state of any internal operation.
Also, could it also suffice, in your case, to just periodically output the values to something like Elasticsearch using a Window Operation. The results could then simply be queried from Elasticsearch.
They are coming with Out-of-box solution called Queryable State in next release.
Here is an example
https://github.com/apache/flink/blob/master/flink-tests/src/test/java/org/apache/flink/test/query/QueryableStateITCase.java
But I suggest you should read about it more first then see the example.
After processing those millions of events/data, where is the best place to storage the information to say that worth to save millions of events? I saw a pull request closed by this commit mentioning Parquet formats, but, the default is the HDFS? My concern is after saving (where?) if it is easy (fast!) to retrieved that data?
Apache Flink is not coupled with specific storage engines or formats. The best place to store the results computed by Flink depends on your use case.
Are you running a batch or streaming job?
What do you want to do with the result?
Do you need batch (full scan), point, or continuously streaming access to the data?
What format does the data have? flat structured (relational), nested, blob, ...
Depending on the answer to these questions, you can choose from various storage backends such as
- Apache HDFS for batch access (with different storage format such as Parquet, ORC, custom binary)
- Apache Kafka if you want to access the data as a stream
- a key-value store such as Apache HBase and Apache Cassandra for point access to data
- a database such as MongoDB, MySQL, ...
Flink provides OutputFormats for most of these systems (some through a wrapper for Hadoop OutputFormats). The "best" system depends on your use case.