Check duplicate apache Flink when job failed

Check duplicate apache Flink when job failed - apache-flink

I'm using Kafka, Flink. My message read by Flink from Kafka then executes some business logic to DB and sends it to Third API (Ex: Mail, GG Sheet), Each message required to send exactly one. Everything work wells, but in case of Job failed and restart (I'm using checkpoint), any message relay and resend to Third API. I can you Redis to check the message that has been sent. In this way, each message should be checked in Redis, and affect performance. I wondering a solution doesn't need to use Redis to check duplicate.

Related

Apache Flink - how to stop and resume stream processing on downstream failure

I have a Flink application that consumes incoming messages on a Kafka topic with multiple partitions, does some processing then sends them to a sink that sends them over HTTP to an external service. Sometimes the downstream service is down the stream processing needs to stop until it is back in action.
There are two approaches I am considering.
Throw an exception when the Http sink fails to send the output message. This will cause the task and job to restart according to the configured restart strategy. Eventually the downstream service will be back and the system will continue where it left off.
Have the Sink sleep and retry on failure; it can do this continually until the downstream service is back.
From what I understand and from my PoC, with 1. I will lose exactly-least once guarantees since the sink itself is external state. As far as I can see, you cannot make a simple HTTP endpoint transactional, as it needs to be to implement TwoPhaseCommitSinkFunction.
With 2. this is less of an issue since pipeline will not proceed until the sink makes a successful write, and I can rely on back pressure throughout the system to pause the retrieval of messages from the Kafka source.
The main questions I have are:
Is it a correct assumption that you can't make a TwoPhaseCommitSinkFunction for a simple HTTP endpoint?
Which of the two strategies, or neither, makes the most sense?
Am I missing simpler obvious solutions?

I think you can try AsyncIO in Flink - https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/asyncio/.
Try to make the HTTP endpoint send a response once all operation has been done for the request, e.g. In http server, the process for the request has been done and the result has been committed to DB. Then use a http async client in AsyncIO operator. The AsyncIO operator will wait until the response is received by the operator. If any error happened, the Flink streaming pipeline will fail and restart the pipeline based on recovery strategy.
All requests to HTTP endpoint without receiving response will be in the internal buffer of AsyncIO operator, and once streaming pipeline failed, the requests pending in the buffer will be saved in the checkpoint state. It will also trigger back pressure when the internal buffer is full.

Is this system an optimal solution to sync an app with a server in real time efficiently?

Problem
I have an Android and iOS app, looking like a classic social network. I need to update UI in real time. Currently, I use a classic system of a client polling each second to a php script by HTTP. The php script bother the database every second for every client and responds, most of the time that there is no new update. If there is a new update, the php script process it and send it back to the client app.
There are 3 problems in this approach : (1) slow user experience (1 second delay each time) + high battery and data usage, (2) apache machines bothered each second by incoming HTTP request, (3) database machine bothered each second by the apaches machines (requesting if they are new stored updates in the main database).
I feel that this system could be substentially improved. For problem (1), I know a TCP connection can be "piped" to the app, but there is still problem (3) because the thread behind the socket still polls the database each second to know if they are new stored updates for their member ID.
Solution ?
I thought of a system to get rid of any activity (client, apaches and database) if there are no new updates. There would be : N apaches server on N machines, a load balancer exposed to the Internet. Behind, these apache server, connected only to local network, 1 "central" database and one "update" database, dedicated for the update system. The "update" database would store 2 tables :
1 table for the mapping between user tokens (and their member ID), and the thread ID and name of current apache machine holding the thread. One user ID may have several connection tokens, but one connection token is associated to only one unique couple (PID - machine name). Each time a user connects to the app, it would create a TCP con held by one thread (in one apache machine), and the [thread ID - machine name] would be stored in that table.
1 table to store the updates themselves. They contain all the informations needed to get up-to-date data (either in raw primitive form like string or int, or in "reference" form, telling the recipient TCP threads it needs to compute "at sending time" some params, for more complex data structures)
The system would be the following :
(1) A user wants to send a message to another user. The app client of the sender sends an HTTP request to the app API endpoint; the load balancer forwards the request to one of the apache machines.
(2) The apache server requests the main database to insert the "user message" row.
(3) The apache server requests the "update" database to know if the recipient has any currently connected device.
(4) if there is at least one connected device, insert an "update" row in the "update" database with all the informations needed, and wake up all thread associated to the recipient user ID (maybe using C signals ?).
(5) All the thread(s) associated to the recipient user ID wake up, they look in the "update" database for new updates associated with their user ID, they process their parameters (especially if there are references params to be computed), they send them back to the recipient devices via TCP.
So my final question is : is such a system feasible, reliable and if so, do you think it can be optimal in term of database and apache machines performence ?
I'm more a front-end programmer and I'm not used to implement complex server architecture, so I wanted to have some opinions before diving into the code, especially if I missed something in my approach (storing PIDs is reliable ? Is it possible for one machine to wake up a thread in another machine through local network ? ...)
PS : I already tried Firebase cloud messaging, but the problem is that they authorize only a one dimension array to be sent with update params. When dealing with complex data structure (like a "user message"), when I receive a signal from FCM in my client app, I still need to make an extra HTTP call to my server to retrieve the new "user message" JSON payload. So, good for my apaches and databases machines (they are not bothered when there is no new updates), bad for the client app that has to send additional HTTP requests. Once again, tell me if I missed something here :)
Thanks for reading

Apache Camel: complete exchanges when an aggregated exchange is completed

In my Apache Camel application, I have a very simple route:
from("aws-sqs://...")
.aggregate(constant(true), new AggregationStrategy())
.completionSize(100)
.to("SEND_AGGREGATE_VIA_HTTP");
That is, it takes messages from AWS SQS, groups them in batches of 100, and sends them via HTTP somewhere.
Exchanges with messages from SQS are completed successfully on getting into the aggregate stage, and SqsConsumer deletes them from the queue at this point.
The problem is that something might happen with an aggregated exchange (it might be delivered with an error), and messages will be lost. I would really like these original exchanges to be completed successfully (messages to be deleted from a queue) only when an aggregated exchange they're in is also completed successfully (a batch of messages is delivered). Is there a way to do this?
Thank you.

You could set deleteAfterRead to false and manually delete the messages after you've sent them to you HTTP endpoint; You could use a bean or a processor and send the proper SQS delete requests through the AWS SDK library. It's a workaround, granted, but I don't see a better way of doing it.

Google Channel API sends a message to all clients

I created a working Google Channel AP and now I would like to send a message to all clients.
I have two servlets. The first creates the channel and tells the clients the userid and token. The second one is called by an http post and should send the message.
To send a message to a client, I use:
channelService.sendMessage(new ChannelMessage(channelUserId, "This is a server message!"));
This sends the message just to one client. How could I send this to all?
Have I to store every Id which I use to create a channel and send the message for every id? How could I pass the Ids to the second servlet?

Using Channel API it is not possible to create one channel and then having many subscribers to it. The server creates a unique channel for individual JavaScript clients, so if you have the same Client ID the messages will be received only by one.
If you want to send the same message to multiple clients, in short, you will have to keep a track of active clients and send the same message to all of them.
If that approach sounds scary and messy, consider using PubNub for your push notification messages, where you can easily create one channel and have many subscribers. To make it run on Google App Engine is not that hard, since they support almost any platform or device.

I know this is an old question, but I just finished an open source project that uses the Channel API to implement a publish/subscribe model, i.e. you can have multiple users subscribe to a single topic, and then all those subscribers will be notified when anyone publishes a message to the topic. It also has some nice features like automatic message persistence if desired, and "return receipts", where a subscriber can be notified whenever OTHER subscribers receive that message. See https://github.com/adevine/gaewebpubsub#gae-web-pubsub. Licensed under Apache 2.0 license.

Structure to handle inter-device messaging

How is the best way to handle messages through a server to multiple devices?
Scenario
It will be an app capable of running on multiple mobile platforms including online in a web browser. A type of instant messenger. The messages will be directed through a server to another mobile device.
The back-end structure/concept must be basically the same as WhatsApp. Sending messages to one-another like that.
What I think
Have the device send it to the web-server.
Server saves it in a queue table in a database.
When receiver device checks for new message (every second) it finds it in the queue.
Remove it from queue and put message in history table.
Final
What would be a efficient way to structure/handle such an app to get similar results as WhatsApp?

You may want to push messages instead of pull them every second. This has two big advantages:
Less bandwidth usage.
You can skip the database part if the sender and the receiver are both connected when the message is sent. Only queue the messages in the database if the receiver isn't connected.
So it's a huge performance boost if you use push.
If you have a web app using JavaScript you can use a JSON stream or, for new browsers, JavaScript WebSokets.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight