Creating a custom atomic acking aggregator

Creating a custom atomic acking aggregator - apache-camel

I've asked similar questions, and had great responses..but my request here seems sufficiently different enough to ask separately.
The Camel Aggregator, as awesome as it is, is not going to cut it for me. I need to aggregate exchange data and when I hit a certain size, forward this onto a queue. When that happens I can then ACK the the original source messages off the queue. The persistence choices of the aggregator isn't really an option based on environmental reasons. No rdms around, and other options would be locally managed state. If the route went down, or the box then I need to be able to carry on processing, and if I had messasges in that db then it is a recovery job. Thanks ZK and camels integration to it!
I'm basically thinking I need to implement a processor/or a bean (what are the subtle differences?) that will take exchanges and put them in a map. When I hit a size forward on the joined exchange to an endpoint, and then somehow ack all the messages.
What I want to know is what api do I use to control the exchange to effectively stop it with out acking and pull what I need to be able to ack later.
Can anyone provide some guidance and point me at the relevant functions on the objects of interest?
I have a nice simple idea to this. I was going to extend the Rabbit* classes and specifically the RabbitConsumer doHandleDelivery and have that do my noddy aggregation. That would call Exchange exchange = consumer.getEndpoint().createRabbitExchange(envelope, properties, body); once the aggregation has complete. And depending on the result of consumer.getProcessor().process(exchange); it would ack or rej all the messages. On the face of it I would say it would all work quite well. Ok I would need some synchronisation in the RabbitConsumer..

Just to give peeps an update I built my own batching rmq consumer.
Pretty simple really, but just had to make sure I built on the onXXX functions so the route could be paused/resumed stopped and started.

Related

Synchronous response with Apache Flink

I could not find any answer to my question on the web so far, so I thought its good to ask here.
I know that Apache Flink is by design asynchronous, but I was wondering if there is any project or design, which aims to build a synchronous pipeline with Flink.
With synchronous response I mean in e.g. having an API endpoint, where I send my data to, the processing is done by Flink, and the outcome of the processing is given back (in what form ever) in the body of the answer to the API call e.g. a 200
I already looked into RabbitMQ RPC but I was not able to successfully implement it.
I'm happy for any direction or suggestion.
Thanks,
Jon

The closest thing that comes into my mind seems to be deploying Flink job with TcpSource available in Apache Bahir. You could have an HTTP endpoint that would receive some data and call Flink on the specified address then process it and create a response. The problem is that there is only TcpSource available in Bahir, which means You would need to create large part of the code (whole Sink) by yourself.
There can be also other ways of doing that (like trying to assign an id to each message and then waiting for message with that Id to arrive on Kafka and sending it as a response, but seems to be troublesome and error-prone)
The other way would be to make the response asynchronous(I know the question specifically mentions sync response but mentioning that just for sake of completeness)
However, I would like to say that this seems like a misuse of Flink to me. Flink was primary designed to allow real-time computations on multiple nodes, which doesn't seem to be a case here. I would suggest looking into different streaming libraries that are much more lightweight, easier to compose, and can offer the functionality You want out-of-the-box. You may want to take a look at Akka Streams for example.

Message queue like RabbitMQ for high volume writes to SQL database?

The scenario is needing to write high volume data, like tracking clicks or mouse movements, from a web application to a SQL database. The data doesn't need to be written right away because the analysis on the data happens on some recurring basis, like daily or weekly.
I want some feedback on a solution that comes to mind:
The click and mouse data is published to a message queue. This stores the queue items in memory so it should be fast and faster than SQL. Then on some other server a job plugs away on retrieving the next queue item and writing the data to SQL.
Does anyone know of implementations like this? What pitfalls am I failing to see? If this solution is not a good one are there other alternatives?
Regards

RabbitMQ is meant for real time message exchange and not for temporary buffering data. If you are able to consume all data as soon as it arrives in your queues, then this solution will work for you. Otherwise RabbitMQ will grow in memory and eventually die. Then you will have to configure it to throw some data away (there are a lot of options to choose rules for this).
You could possibly store data in Redis cache, you can do it as fast as you publish your events to RabbitMQ. Then you can listen to the new changes in Redis from remote server and fill up whatever database storage you use, or even use it as your data storage.

To solve a very similar problem I was considering doing exactly this. In the end we decided not to go for it because we did need access to the data very quickly. However I still like the idea.
Ive also recently learnt that under the hood this is exactly the way that Microsofft Dynamics CRM does its database updates, using message passing.
Things I think you would need to pay careful attention to.
Make sure that if your RabbitMQ instance disappeared it wouldnt have any affect on your client. Rabbit dying is bad enough, your client erroring because Rabbit is down would be terrible.
If it's truly very high volume (and its good practice for reliability anyway) clustering is something worth looking at.
Obviously paying attention to your deadletter queues is a must. But the ability to play back messages which failed for some reason is awesome, in theory at least your data should eventually always get to you database. Even if it went down for a period of time.
Make sure you can keep up with the number of messages being passed in. Of course, this should be solvable by adding more consumer to a given queue. Which leads to...
Idempotency of messages. Given that your messages relate directly to a DB write, they HAVE to be idempotent.

Parallel calls to google.appengine.api.channel.send_message

I am using send_message(client_id, message) in google.appengine.api.channel to fan out messages. The most common use case is two users. A typical trace looks like the following:
The two calls to send_message are independent. Can I perform them in parallel to save latency?

Well there's no async api available, so you might need to implement a custom solution.
Have you already tried with native threading? It could work in theory, but because of the GIL, the xmpp api must block by I/O, which I'm not sure it does.
A custom implementation will invariably come with some overhead, so it might not be the best idea for your simple case, unless it breaks the experience for the >2 user cases.
There is, however, another problem that might make it worth your while: what happens if the instance crashes and only got to send the first message? The api isn't transactional, so you should have some kind of safeguard. Maybe a simple recovery mode will suffice, given how infrequently this will happen, but I'm willing to bet a transactional message channel sounds more appealing, right?
Two ways you could go about it, off the top of my head:
Push a task for every message, they're transactional and guaranteed to run, and will execute in parallel with a fairly identical run time. It'll increase the time it takes for the first message to go out but will keep it consistent between all of them.
Use a service built for this exact use case, like firebase (though it might even be too powerful lol), in my experience the channel api is not very consistent and the performance is underwhelming for gaming, so this might make your system even better.

Fixed that for you
I just posted a patch on googleappengine issue 9157, adding:
channel.send_message_async for asynchronously sending a message to a recipient.
channel.send_message_multi_async for asynchronously broadcasting a single message to multiple recipients.
Some helper methods to make those possible.
Until the patch is pushed into the SDK, you'll need to include the channel_async.py file (that's attached on that thread).
Usage
import channel_async as channel
# this is synchronous D:
channel.send_message(<client-id>, <message>)
# this is asynchronous :D
channel.send_message_async(<client-id>, <message>)
# this is good for broadcasting a single message to multiple recipients
channel.send_message_multi_async([<client-id>, <client-id>], <message>)
# or
channel.send_message_multi_async(<list-of-client-ids>, <message>)
Benefits
Speed comparison on production:
Synchronous model: 2 - 80 ms per recipient (and blocking -.-)
Asynchronous model: 0.15 - 0.25 ms per recipient

working with new channel creation limits

Google app engine seems to have recently made a huge decrease in free quotas for channel creation from 8640 to 100 per day. I would appreciate some suggestions for optimizing channel creation, for a hobby project where I am unwilling to use the paid plans.
It is specifically mentioned in the docs that there can be only one client per channel ID. It would help if there were a way around this, even if it were only for multiple clients on one computer (such as multiple tabs)
It occurred to me I might be able to simulate channel functionality by repeatedly sending XHR requests to the server to check for new messages, therefore bypassing limits. However, I fear this method might be too slow. Are there any existing libraries that work on this principle?

One Client per Channel
There's not an easy way around the one client per channel ID limitation, unfortunately. We actually allow two, but this is to handle the case where a user refreshes his page, not for actual fan-out.
That said, you could certainly implement your own workaround for this. One trick I've seen is to use cookies to communicate between browser tabs. Then you can elect one tab the "owner" of the channel and fan out data via cookies. See this question for info on how to implement the inter-tab communication: Javascript communication between browser tabs/windows
Polling vs. Channel
You could poll instead of using the Channel API if you're willing to accept some performance trade-offs. Channel API deliver speed is on the order of 100-200ms; if you could accept 500ms average then you could poll every second. Depending on the type of data you're sending, and how much you can fit in memcache, this might be a workable solution. My guess is your biggest problem is going to be instance-hours.
For example, if you have, say, 100 clients you'll be looking at 100qps. You should experiment and see if you can serve 100 requests in a second for the data you need to serve without spinning up a second instance. If not, keep increasing your latency (ie., decreasing your polling frequency) until you get to 1 instance able to serve your requests.
Hope that helps.

What's the best way for the client app to immediately react to an update in the database?

What is the best way to program an immediate reaction to an update to data in a database?
The simplest method I could think of offhand is a thread that checks the database for a particular change to some data and continually waits to check it again for some predefined length of time. This solution seems to be wasteful and suboptimal to me, so I was wondering if there is a better way.
I figure there must be some way, after all, a web application like gmail seems to be able to update my inbox almost immediately after a new email was sent to me. Surely my client isn't continually checking for updates all the time. I think the way they do this is with AJAX, but how AJAX can behave like a remote function call I don't know. I'd be curious to know how gmail does this, but what I'd most like to know is how to do this in the general case with a database.
Edit:
Please note I want to immediately react to the update in the client code, not in the database itself, so as far as I know triggers can't do this. Basically I want the USER to get a notification or have his screen updated once the change in the database has been made.

You basically have two issues here:
You want a browser to be able to receive asynchronous events from the web application server without polling in a tight loop.
You want the web application to be able to receive asynchronous events from the database without polling in a tight loop.
For Problem #1
See these wikipedia links for the type of techniques I think you are looking for:
Comet
Reverse AJAX
HTTP Server Push
EDIT: 19 Mar 2009 - Just came across ReverseHTTP which might be of interest for Problem #1.
For Problem #2
The solution is going to be specific to which database you are using and probably the database driver your server uses too. For instance, with PostgreSQL you would use LISTEN and NOTIFY. (And at the risk of being down-voted, you'd probably use database triggers to call the NOTIFY command upon changes to the table's data.)
Another possible way to do this is if the database has an interface to create stored procedures or triggers that link to a dynamic library (i.e., a DLL or .so file). Then you could write the server signalling code in C or whatever.
On the same theme, some databases allow you to write stored procedures in languages such as Java, Ruby, Python and others. You might be able to use one of these (instead of something that compiles to a machine code DLL like C does) for the signalling mechanism.
Hope that gives you enough ideas to get started.

I figure there must be some way, after
all, web application like gmail seem
to update my inbox almost immediately
after a new email was sent to me.
Surely my client isn't continually
checking for updates all the time. I
think the way they do this is with
AJAX, but how AJAX can behave like a
remote function call I don't know. I'd
be curious to know how gmail does
this, but what I'd most like to know
is how to do this in the general case
with a database.
Take a peek with wireshark sometime... there's some google traffic going on there quite regularly, it appears.
Depending on your DB, triggers might help. An app I wrote relies on triggers but I use a polling mechanism to actually 'know' that something has changed. Unless you can communicate the change out of the DB, some polling mechanism is necessary, I would say.
Just my two cents.

Well, the best way is a database trigger. Depends on the ability of your DBMS, which you haven't specified, to support them.
Re your edit: The way applications like Gmail do it is, in fact, with AJAX polling. Install the Tamper Data Firefox extension to see it in action. The trick there is to keep your polling query blindingly fast in the "no news" case.

Unfortunately there's no way to push data to a web browser - you can only ever send data as a response to a request - that's just the way HTTP works.
AJAX is what you want to use though: calling a web service once a second isn't excessive, provided you design the web service to ensure it receives a small amount of data, sends a small amount back, and can run very quickly to generate that response.