Message Queue with Apache Storm - solr

I'm incredibly new to Apache storm and the expansive options available with message queues. The current system reads in files from a data store (text, binary, anything) and passes them into Apache solr for indexing. However, additional processing needs to be done with these files, which is where storm comes in. During the UpdateRequestProcessorChain in storm, it appears that I can write the file being processed to a message broker, which i can then pull with storm to do some parallel real-time processing.
I am expecting an average of 10,000 requests per second at 4KB/message. However, there is a possibility (albeit very rare) of a 100GB+ file being passed in over several seconds. Is there a message queue that will still work well with those requirements?
I already looked into Kafka, which seems to be optimized for 1KB messages. RabbitMQ does not seem to like large files. ActiveMQ does seems to have blob messages for sending large files. Does anyone have experience with any of the above or others?

I don't think putting 100GB+ file in any message queue is a good idea. You can preprocess the file and break it into manageable chunks before putting it into message queue. You can add some kind of id to each chunk, so that you can relate different chunks of the file in Storm while processing. Also, it is also not a good idea to store a very large file as one document in Solr.

Related

Apache Camel splitting large file

I have a Camel route that needs to split a large file (600k lines of id's) into 600k individual messages and then push them onto an Activemq queue. How do I optimize the route from the Camel side to increase throughput? I'm currently achieving ~150 messages/second throughput to AMQ. Here's what the route looks like currently. Any suggestions are appreciated!
from("file://directory")
.split().jsonpath("$.ids").streaming().parallelProcessing()
.log(LoggingLevel.INFO, "Split: ${body}")
.to("activemq:queue:myqueue");
First things first, as pointed out by #Bedla, Pool your connection (i.e. wrap your connection factory inside a org.apache.activemq.pool.PooledConnectionFactory).! It will most likely give you a throughput boost in the range of x10 to x100 depending on network conditions, message size etc. More for smaller messages.
Then, when hunting throughput, dumping each of the 600k lines into your log file will do no one no good. Remove it or at least put it to trace/debug level.
If your broker is located elsewhere, like other part of the world, or a place with bad network latency in general, consider using async dispatch on the ConnectionFactory setting. It will not wait for the roundtrip of a confirmation of each message sent.
Then finally, if none of the above give decent result (I think just a pool should do) turn off message persistence. The broker disk might be a bottle neck for low spec/old servers. There are even tweaks to enhance certain OS/storage combos for performance.

Database for large data files and streaming

I have a "database choice" and arhitecture question.
Use-case:
Clients will upload large .json files (or other format like .tsv, it is irrelevant) where each line is a data about their customers (e.g name, address etc.)
We need to stream this data later on to process it and store results which will also be some large file where each line is data about each customer (approximately same as uploaded file).
My requirements:
Streaming should be as fast it could (e.g > 1000 rps) and we could have multiple process running in parallel (for multiple clients)
Database should be scalable and fault tolerant. Because there could easily be uploaded many GB of data it should be easy for me to implement automatically adding new commodity instances (using AWS) if storage gets low.
Database should have kind of replication because we don't want to lose data.
No index is required since we are just streaming data.
What would you suggest for database for this problem? We tried to upload it to Amazon S3 and let them take care of scaling etc. but there is a problem of slow read/streaming.
Thanks,
Ivan
Initially uploading the files to S3 is fine, but then pick them up and push each line to Kinesis (or MSK or even Kafka on EC2s if you prefer); from there, you can hook up the stream processing framework of your choice (Flink, Spark Streaming, Samza, Kafka Streams, Kinesis KCL) to do transformations and enrichment, and finally you’ll want to pipe the results into a storage stack that will allow streaming appends. A few obvious candidates:
HBase
Druid
Keyspaces for Cassandra
Hudi (or maybe LakeFS?) on top of S3
Which one you choose is kind of up to your needs downstream in terms of query flexibility, latency, integration options/standards, etc.

Flink batch: data local planning on HDFS?

we've been playing a bit with Flink. So far we've been using Spark and standard M/R on Hadoop 2.x / YARN.
Apart from the Flink execution model on YARN, that AFAIK is not dynamic like spark where executors dynamically take and release virtual-cores in YARN, the main point of the question is as follows.
Flink seems just amazing: for streaming API's, I'd only say that it's brilliant and over the top.
Batch API's: processing graphs are very powerful and are optimised and run in parallel in a unique way, leveraging cluster scalability much more than Spark and others, optiziming perfectly very complex DAG's that share common processing steps.
The only drawback I found, that I hope is just my misunderstanding and lack of knowledge is that it doesn't seem to prefer data-local processing when planning the batch jobs that use input on HDFS.
Unfortunately it's not a minor one because in 90% use cases you have a big-data partitioned storage on HDFS and usually you do something like:
read and filter (e.g. take only failures or successes)
aggregate, reduce, work with it
The first part, when done in simple M/R or spark, is always planned with the idiom of 'prefer local processing', so that data is processed by the same node that keeps the data-blocks, to be faster, to avoid data-transfer over the network.
In our tests with a cluster of 3 nodes, setup to specifically test this feature and behaviour, Flink seemed to perfectly cope with HDFS blocks, so e.g. if file was made up of 3 blocks, Flink was perfectly handling 3 input-splits and scheduling them in parallel.
But w/o the data-locality pattern.
Please share your opinion, I hope I just missed something or maybe it's already coming in a new version.
Thanks in advance to anyone taking the time to answer this.
Flink uses a different approach for local input split processing than Hadoop and Spark. Hadoop creates for each input split a Map task which is preferably scheduled to a node that hosts the data referred by the split.
In contrast, Flink uses a fixed number of data source tasks, i.e., the number of data source tasks depends on the configured parallelism of the operator and not on the number of input splits. These data source tasks are started on some node in the cluster and start requesting input splits from the master (JobManager). In case of input splits for files in an HDFS, the JobManager assigns the input splits with locality preference. So there is locality-aware reading from HDFS. However, if the number of parallel tasks is much lower than the number of HDFS nodes, many splits will be remotely read, because, source tasks remain on the node on which they were started and fetch one split after the other (local ones first, remote ones later). Also race-conditions may happen if your splits are very small as the first data source task might rapidly request and process all splits before the other source tasks do their first request.
IIRC, the number of local and remote input split assignments is written to the JobManager logfile and might also be displayed in the web dashboard. That might help to debug the issue further. In case you identify a problem that does not seem to match with what I explained above, it would be great if you could get in touch with the Flink community via the user mailing list to figure out what the problem is.

How to create large number of entities in Cloud Datastore

My requirement is to create large number of entities in Google Cloud Datastore. I have csv files and in combine number of entities can be around 50k. I tried following:
1. Read a csv file line by line and create entity in the datstore.
Issues: It works well but it timed out and cannot create all the entities in one go.
2. Uploaded all files in Blobstore and red them to datastore
Issues: I tried Mapper function to read csv files uploaded in Blobstore and create Entities in datastore. Issues i have are, mapper does not work if file size go larger than 2Mb. Also I simply tried to read files in a servlet but again timedout issue.
I am looking for a way to create above(50k+) large number of entities in datastore all in one go.
Number of entities isn't the issue here (50K is relatively trivial). Finishing your request within the deadline is the issue.
It is unclear from your question where you are processing your CSVs, so I am guessing it is part of a user request - which means you have a 60 second deadline for task completion.
Task Queues
I would suggest you look into using Task Queues, where when you upload a CSV that needs processing, you push it into a queue for background processing.
When working with Tasks Queues, the tasks themselves still have a deadline, but one that is larger than 60 seconds (10 minutes when automatically scaled). You should read more about deadlines in the docs to make sure you understand how to handle them, including catching the DeadlineExceededError error so that you can save when you are up to in a CSV so that it can be resumed from that position when retried.
Caveat on catching DeadlineExceededError
Warning: The DeadlineExceededError can potentially be raised from anywhere in your program, including finally blocks, so it could leave your program in an invalid state. This can cause deadlocks or unexpected errors in threaded code (including the built-in threading library), because locks may not be released. Note that (unlike in Java) the runtime may not terminate the process, so this could cause problems for future requests to the same instance. To be safe, you should not rely on theDeadlineExceededError, and instead ensure that your requests complete well before the time limit.
If you are concerned about the above, and cannot ensure your task completes within the 10 min deadline, you have 2 options:
Switch to a manually scaled instance which gives you are 24 hour deadline.
Ensure your tasks saves progress and returns an error well before the 10 min deadline so that it can be resumed correctly without having to catch the error.

BizTalk 2013 Start Message Processing Before Source File Finishes?

We've got a large & complicated file that takes a long time to disassemble (say, an hour). It would be great if we could spin off messages as they leave the receive pipeline and start on their itinerary immediately before the file is finished. I can tell that it's not easy but is it possible at all?
Not out of the box. Pipeline disassembly is transactional so, as you observe, the entire interchange is debatched and commited to the MessageBox at once.
Here are some options:
If you are receiving a Flat File where each row is a message, use SSIS to load it into a Table, then use the SQL Adapter, drain the messages by polling out ~10 at a time.
If you are receiving a complex Flat File or Xml, you can wrap eithr XmlDasm or FFDasm in a custom Disassembler Component but instead of returning the debatched messages to the MessageBox, push them somwhere else. A) The file system is easy if order is not required. B) MSMQ will maintain the order of the message as they appear in the file.
I have used both of these where the incoming files have had 100k to 400k records and it does provide a more managable performance profile.

Resources