Apache Camel splitting large file

Apache Camel splitting large file - apache-camel

I have a Camel route that needs to split a large file (600k lines of id's) into 600k individual messages and then push them onto an Activemq queue. How do I optimize the route from the Camel side to increase throughput? I'm currently achieving ~150 messages/second throughput to AMQ. Here's what the route looks like currently. Any suggestions are appreciated!
from("file://directory")
.split().jsonpath("$.ids").streaming().parallelProcessing()
.log(LoggingLevel.INFO, "Split: ${body}")
.to("activemq:queue:myqueue");

First things first, as pointed out by #Bedla, Pool your connection (i.e. wrap your connection factory inside a org.apache.activemq.pool.PooledConnectionFactory).! It will most likely give you a throughput boost in the range of x10 to x100 depending on network conditions, message size etc. More for smaller messages.
Then, when hunting throughput, dumping each of the 600k lines into your log file will do no one no good. Remove it or at least put it to trace/debug level.
If your broker is located elsewhere, like other part of the world, or a place with bad network latency in general, consider using async dispatch on the ConnectionFactory setting. It will not wait for the roundtrip of a confirmation of each message sent.
Then finally, if none of the above give decent result (I think just a pool should do) turn off message persistence. The broker disk might be a bottle neck for low spec/old servers. There are even tweaks to enhance certain OS/storage combos for performance.

Related

I want to log all mqtt messages of the broker. How should I design schema of database. Avoiding dulplicate entries and fast searching

I am implementing a callback in java to store messages in a database. I have a client subscribing to '#'. But the problem is when this # client disconnects and reconnect it adds duplicate entries in the database of retained messages. If I search for previous entries bigger tables will be expensive in computing power. So should I allot a separate table for each sensor or per broker. I would really appreciate if you suggest me better designs.

Subscribing to wildcard with a single client is definitely an anti-pattern. The reasons for that are:
Wildcard subscribers get all messages of the MQTT broker. Most client libraries can't handle that load, especially not when transforming / persisting messages.
If you wildcard subscriber dies, you will lose messages (unless the broker queues endlessly for you, which also doesn't work)
You essentially have a single point of failure in your system. Use MQTT brokers which are hardened for production use. These are much more robust single point of failures than your hand-written clients. (You can overcome the SIP through clustering and load balancing, though).
So to solve the problem, I suggest the following:
Use a broker which can handle shared subscriptions (like HiveMQ or MessageSight), so you can balance all messages between many clients
Use a custom plugin for doing the persistence at the broker instead of the client.
You can also read more about that topic here: http://www.hivemq.com/blog/mqtt-sql-database

Also consider using QoS = 3 for all message to make sure one and only one message is delivered. Also you may consider time-stamp each message to avoid inserting duplicate messages if QoS requirement is not met.

Message Queue with Apache Storm

I'm incredibly new to Apache storm and the expansive options available with message queues. The current system reads in files from a data store (text, binary, anything) and passes them into Apache solr for indexing. However, additional processing needs to be done with these files, which is where storm comes in. During the UpdateRequestProcessorChain in storm, it appears that I can write the file being processed to a message broker, which i can then pull with storm to do some parallel real-time processing.
I am expecting an average of 10,000 requests per second at 4KB/message. However, there is a possibility (albeit very rare) of a 100GB+ file being passed in over several seconds. Is there a message queue that will still work well with those requirements?
I already looked into Kafka, which seems to be optimized for 1KB messages. RabbitMQ does not seem to like large files. ActiveMQ does seems to have blob messages for sending large files. Does anyone have experience with any of the above or others?

I don't think putting 100GB+ file in any message queue is a good idea. You can preprocess the file and break it into manageable chunks before putting it into message queue. You can add some kind of id to each chunk, so that you can relate different chunks of the file in Storm while processing. Also, it is also not a good idea to store a very large file as one document in Solr.

Resuming Camel Processing after power failure

I'm currently developing a Camel Integration app in which resumption from a previous state of processing is important. When there's a power outage, for instance, it's important that all previously processed messages are not re-processed. The processing should resume from where it left off before the outage.
I've gone through a number of possible solutions including Terracotta and Apache Shiro. I'm not sure how to use either as documentation on the integration with Apache Camel is scarce. I've not settled on the two, however.
I'm looking for suggestions on the potential alternatives I can use or a pointer to some tutorial to get me started.

The difficulty in surviving outages lies primarily in state, and what to do with in-flight messages.
Usually, when you're talking state within routes the solution is to flush it to disk, or other nodes in the cluster. Taking the aggregator pattern as an example, aggregated state is persisted in an aggregation repository. The default implementation is in memory, so if the power goes out, all the state is lost. However, there are other implementations, including one for JDBC, and another using Hazelcast (a lightweight in-memory data grid). I haven't used Hazelcast myself, but JDBC does a synchronous write to disk. The aggregator pattern allows you to resume from where you left off. A similar solution exists for idempotent consumption.
The second question, around in-flight messages is a little more complicated, and largely depends on where you are consuming from. If you're in the middle of handling a web service request, and the power goes out, does it matter if you have lost the message? The user can simply retry. Any effects on external systems can be wrapped in a transaction, or an idempotent consumer with JDBC idempotent repository.
If you are building out integrations based on messaging, you should consume within a transaction, so that if your server goes down, the messages go back into the broker and can be replayed to another consumer.
Be careful when using seda: or threads blocks, these use an in-memory queue to pass exchanges between threads, any messages flowing down these sorts of routes will be lost if someone trips over the power cable. If you can't afford message loss, and need this sort of processing model, consider using a JMS queue as the endpoints between the two routes (with transactions to ensure you pick up where you left off).

Akka Actors: Handling DB Failures Without Losing Data

Scenario
The DB for an application has gone down. This results in any actor responsible for committing important data to the DB failing to get a connection
Preferred Behaviour
The important data is written to the db when it comes back up sometime in the future.
Current Implementation
The actor catches the DBException, wraps the data in a DBWriteFailed case class, and sends the message to its supervisor. The supervisor then schedules another write for sometime in the future (e.g. 1 minute) using system.scheduler.scheduleOnce(...) so that we don't spin in circles too much while waiting for the DB to come back up.
This implementation certainly works but I feel there might be a better way.
The protocol gets a bit messier when the committing actor has to respond to the original sender after a successful commit.
The regular flow of messages to the committing actor is not throttled in any way and the actor will happily process the new messages, likely failing to connect to the DB for each and every one of them.
If messages get caught in this retry loop for too long, the mailboxes of the committing actors will start to balloon. It is important that this data be committed, but none of it matters if the application crawls to a halt or crashes due to excessive memory usage.
I am an akka novice and I am largely inexperienced when it comes to supervisor strategies, but I feel as though I may be able to leverage one of those to handle some of this retry logic.
Is there a common approach in akka for solving a problem like this? Am I on the right track or should I be heading in a different direction?
Any help is appreciated.

You can use Akka Circuit Breaker to reduce connection attempts. Instead of using the scheduler as retry queue I would use a buffer (with max size limit) inside the actor and retry those when circuit breaker becomes closed again (onClose callback should send message to self actor). An alternative could be to combine the circuit breaker with a stashing mailbox.

If you're planning to implement full failover in your app
Don't.
Do not bubble database failover responsibility up into the app layer. As far as your app is concerned, the database should just be up and ready to accept reads and writes.
If your database goes down often, spend time making your database more robust (there's a multitude of resources on the web already for this: search the web for terms like 'replication', 'high availability', 'load-balancing' and 'clustering', and learn from the war stories of others at highscalability.com). It all really depends on what the cause of your DB outages are (e.g. I once maxed out the NIC on the DB master, and "fixed" the problem intermittently by enabling GZIP on the wire).
You'll be glad you adhered to a separation of concerns if you go down this route.
If you're planning to implement the odd sprinkling of retry logic and handling DB brown-outs
If you're not expecting your app to become a replacement database, then Patrik's answer is the best way to go.

Why is it bad practice to make multiple database connections in one request?

A discussion about Singletons in PHP has me thinking about this issue more and more. Most people instruct that you shouldn't make a bunch of DB connections in one request, and I'm just curious as to what your reasoning is. My first thought is the expense to your script of making that many requests to the DB, but then I counter myself with the question: wouldn't multiple connections make concurrent querying more efficient?
How about some answers (with evidence, folks) from some people in the know?

Database connections are a limited resource. Some DBs have a very low connection limit, and wasting connections is a major problem. By consuming many connections, you may be blocking others for using the database.
Additionally, throwing a ton of extra connections at the DB doesn't help anything unless there are resources on the DB server sitting idle. If you've got 8 cores and only one is being used to satisfy a query, then sure, making another connection might help. More likely, though, you are already using all the available cores. You're also likely hitting the same harddrive for every DB request, and adding additional lock contention.
If your DB has anything resembling high utilization, adding extra connections won't help. That'd be like spawning extra threads in an application with the blind hope that the extra concurrency will make processing faster. It might in some certain circumstances, but in other cases it'll just slow you down as you thrash the hard drive, waste time task-switching, and introduce synchronization overhead.

It is the cost of setting up the connection, transferring the data and then tearing it down. It will eat up your performance.
Evidence is harder to come by but consider the following...
Let's say it takes x microseconds to make a connection.
Now you want to make several requests and get data back and forth. Let's say that the difference in transport time is negligable between one connection and many (just ofr the sake of argument).
Now let's say it takes y microseconds to close the connection.
Opening one connection will take x+y microseconds of overhead. Opening many will take n * (x+y). That will delay your execution.

Setting up a DB connection is usually quite heavy. A lot of things are going on backstage (DNS resolution/TCP connection/Handshake/Authentication/Actual Query).
I've had an issue once with some weird DNS configuration that made every TCP connection took a few seconds before going up. My login procedure (because of a complex architecture) took 3 different DB connections to complete. With that issue, it was taking forever to log-in. We then refactored the code to make it go through one connection only.

We access Informix from .NET and use multiple connections. Unless we're starting a transaction on each connection, it often is handled in the connection pool. I know that's very brand-specific, but most(?) database systems' cilent access will pool connections to the best of its ability.
As an aside, we did have a problem with connection count because of cross-database connections. Informix supports synonyms, so we synonymed the common offenders and the multiple connections were handled server-side, saving a lot in transfer time, connection creation overhead, and (the real crux of our situtation) license fees.

I would assume that it is because your requests are not being sent asynchronously, since your requests are done iteratively on the server, blocking each time, you have to pay for the overhead of creating a connection each time, when you only have to do it once...
In Flex, all web service calls are automatically called asynchronously, so you it is common to see multiple connections, or queued up requests on the same connection.
Asynchronous requests mitigate the connection cost through faster request / response time...because you cannot easily achieve this in PHP without out some threading, then the performance hit is greater then simply reusing the same connection.
that's my 2 cents...

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight