Spark: run InputFormat as singleton - database

I'm trying to integrate a key-value database to Spark and have some questions.
I'm a Spark beginner, have read a lot and run some samples but nothing too
complex.
Scenario:
I'm using a small hdfs cluster to store incoming messages in a database.
The cluster has 5 nodes, and the data is split into 5 partitions. Each
partition is stored in a separate database file. Each node can therefore process
its own partition of the data.
The Problem:
The interface to the database software is based on JNI, the database itself is
implemented in C. For technical reasons, the database software can maintain
only one active connection at a time. There can be only one JVM process which
is connected to the Database.
Because of this limitation, reading from and writing to the database must go
through the same JVM process.
(Background info: the database is embedded into the process. It's file based,
and only one process can open it at a time. I could let it run in a separate
process, but that would be slower because of the IPC overhead. My application
will perform many full table scans. Additional writes will be batched and are
not time-critical.)
The Solution:
I have a few ideas in my mind how to solve this, but i don't know if they work
well with Spark.
Maybe it's possible to magically configure Spark to only have one instance of my
proprietary InputFormat per node.
If my InputFormat is used for the first time, it starts a separate thread
which will create the database connection. This thread will then continue
as a daemon and will live as long as the JVM lives. This will only work
if there's just one JVM per node. If Spark starts multiple JVMs on the
same node then each would start its own database thread, which would not
work.
Move my database connection to a separate JVM process per node, and my
InputFormat then uses IPC to connect to this process. As i said, i'd like to avoid this.
Or maybe you have another, better idea?
My favourite solution would be #1, followed closely by #2.
Thanks for any comment and answer!

I believe the best option here is to connect to your DB from driver, not from executors. This part of the system anyway would be a bottleneck.

Have you thought of queueing (buffer) then using spark streaming to dequeue and use your output format to write.

If data from your DB fits into RAM memory of your spark-driver you can load it there as a collection and then parallelize it to an RDD https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#parallelized-collections

Related

Data/event exchange between jobs

Is it possible in Apache Flink, to create an application, which consists of multiple jobs who build a pipeline to process some data.
For example, consider a process with an input/preprocessing stage, a business logic and an output stage.
In order to be flexible in development and (re)deployment, I would like to run these as independent jobs.
Is it possible in Flink to built this and directly pipe the output of one job to the input of another (without external components)?
If yes, where can I find documentation about this and can it buffer data if one of the jobs is restarted?
If no, does anyone have experience with such a setup and point me to a possible solution?
Thank you!
If you really want separate jobs, then one way to connect them is via something like Kafka, where job A publishes, and job B (downstream) subscribes. Once you disconnect the two jobs, though, you no longer get the benefit of backpressure or unified checkpointing/saved state.
Kafka can do buffering of course (up to some max amount of data), but that's not a solution to a persistent different in performance, if the upstream job is generating data faster than the downstream job can consume it.
I imagine you could also use files as the 'bridge' between jobs (streaming file sink and then streaming file source), though that would typically create significant latency as the downstream job has to wait for the upstream job to decide to complete a file, before it can be consumed.
An alternative approach that's been successfully used a number of times is to provide the details of the preprocessing and business logic stages dynamically, rather than compiling them into the application. This means that the overall topology of the job graph is static, but you are able to modify the processing logic while the job is running.
I've seen this done with purpose-built DSLs, PMML models, Javascript (via Rhino), Groovy, Java classloading, ...
You can use a broadcast stream to communicate/update the dynamic portions of the processing.
Here's an example of this pattern, described in a Flink Forward talk by Erik de Nooij from ING Bank.

Flink batch: data local planning on HDFS?

we've been playing a bit with Flink. So far we've been using Spark and standard M/R on Hadoop 2.x / YARN.
Apart from the Flink execution model on YARN, that AFAIK is not dynamic like spark where executors dynamically take and release virtual-cores in YARN, the main point of the question is as follows.
Flink seems just amazing: for streaming API's, I'd only say that it's brilliant and over the top.
Batch API's: processing graphs are very powerful and are optimised and run in parallel in a unique way, leveraging cluster scalability much more than Spark and others, optiziming perfectly very complex DAG's that share common processing steps.
The only drawback I found, that I hope is just my misunderstanding and lack of knowledge is that it doesn't seem to prefer data-local processing when planning the batch jobs that use input on HDFS.
Unfortunately it's not a minor one because in 90% use cases you have a big-data partitioned storage on HDFS and usually you do something like:
read and filter (e.g. take only failures or successes)
aggregate, reduce, work with it
The first part, when done in simple M/R or spark, is always planned with the idiom of 'prefer local processing', so that data is processed by the same node that keeps the data-blocks, to be faster, to avoid data-transfer over the network.
In our tests with a cluster of 3 nodes, setup to specifically test this feature and behaviour, Flink seemed to perfectly cope with HDFS blocks, so e.g. if file was made up of 3 blocks, Flink was perfectly handling 3 input-splits and scheduling them in parallel.
But w/o the data-locality pattern.
Please share your opinion, I hope I just missed something or maybe it's already coming in a new version.
Thanks in advance to anyone taking the time to answer this.
Flink uses a different approach for local input split processing than Hadoop and Spark. Hadoop creates for each input split a Map task which is preferably scheduled to a node that hosts the data referred by the split.
In contrast, Flink uses a fixed number of data source tasks, i.e., the number of data source tasks depends on the configured parallelism of the operator and not on the number of input splits. These data source tasks are started on some node in the cluster and start requesting input splits from the master (JobManager). In case of input splits for files in an HDFS, the JobManager assigns the input splits with locality preference. So there is locality-aware reading from HDFS. However, if the number of parallel tasks is much lower than the number of HDFS nodes, many splits will be remotely read, because, source tasks remain on the node on which they were started and fetch one split after the other (local ones first, remote ones later). Also race-conditions may happen if your splits are very small as the first data source task might rapidly request and process all splits before the other source tasks do their first request.
IIRC, the number of local and remote input split assignments is written to the JobManager logfile and might also be displayed in the web dashboard. That might help to debug the issue further. In case you identify a problem that does not seem to match with what I explained above, it would be great if you could get in touch with the Flink community via the user mailing list to figure out what the problem is.

MPI how to send and receive SQLite database

I have a big SQLite database to process, so I would like to use MPI for parallelization to accelerate the speed. What I want to do is sending a database from root to every slave, and sending the modified databases to root after slave add some table into it. I want to use MPI_Type_create_struct to create a datatype to store database, but the database is too complicated. IS there any other way to handle this situation? Thank you in advance!
I recently dealt with a similar problem - I have a large MPI application that uses SQLite as a configuration store. Handling multi-process writes is a challenge with an embedded SQL database. My experience with this involves a massively parallel application (running up to 65,535 ranks) with a shared filesystem.
Based on the FAQ from SQLite and some experience with database engines, there are a few ways to approach this problem. I am making the assumption that you are operating with a shared distributed file system, and multiple separate computers (a standard HPC cluster setup).
Since SQLite will block when multiple processes write to the database (but not read), reads will most likely not be an issue. Each process can run multiple SELECT commands at the same time without issue.
The challenge will be in the writing. Disk I/O is several orders of magnitude slower than computation, so generally this will be the bottleneck. Having said that, network communication may also be a significant slowdown, so how you approach the problem really depends on where the weakest link of your running environment will be.
If you have a fast network and slow disk speed, or if you want to implement this in the most straightforward way possible, your best bet is to have a single MPI rank in charge of writing to the database. Your compute processes would independently run SELECT commands until computation was complete, then send the new data to the MPI database process. The database control process would then write the new data to disk. I would not try to send the structure of the database across the network, rather I would send the data that should be written, along with (possibly) a flag that would identify what table/insert query the data should be written with. This technique is sort of similar to how a RDBMS works - while RDBMS servers do support concurrent writes, there is a "central" process in control of the ordering of write operations.
One thing to note is that if a process writes to the SQLite database, the file is locked for all processes that are trying to read or write to it. You will need to either handle the SQLITE_BUSY return code in your worker processes, register a callback to handle this, change the busy behavior, or use an alternate technique. In my application, I found that loading the database as an in-memory database, (https://www.sqlite.org/inmemorydb.html) for the readers provided a good workaround. Readers access the in-memory database, but sent results to the controlling process for writes. The downside is that you will have multiple copies of the database in memory.
Another option that might be less network intensive is to do the reads concurrently and have each worker process write out to their own file. You could write out to separate SQLite database files, or even export something like CSV (depending on the complexity of the data). When writes are complete, you would then have a single process merge the individual files into a single result database file - see How can I merge many SQLite databases?. This method has its own issues, but depending on where your bottlenecks are and how the system as a whole is laid out, this technique may work.
Finally, you might consider reading from the SQLite database and saving the data to a proper distributed file format, such as HDF5 (or using MPI IO). Once the computation is done, it would be pretty straightforward to write a script that would create a new SQLite database from this foreign file format.

Multiple Biztalk host instances writing to single file

We have four Biztalk servers on production envionment. The sendport is configured to write incoming message in one textfile. This port receives thousands of messages in a day. So multiple host instances tries to write to file at single time, before one instance finishes writing complete record another instances starts writing new record causing data scattered all over the file.
What can we do resolve this issue?
...before one instance finishes writing complete record another instances starts writing new record causing data scattered all over the file.
What can we do resolve this issue?
The easy way is to only use a single Host Instance to write data to the file, however you may then start to experience throttling issues. Alternatively, you could explore using the 'Allow Cache on write' option on the File Adapter which may offer some improvements.
However, I think your approach is wrong. You cannot expect four separate and totally disconnected processes (across 4 servers no-less) to reliably append to a single file - IN ORDER.
I therefore think you should look re-architecting this solution:
As each message is received, write the contents of the message to a database table (a simple INSERT) with an 'unprocessed' flag. You can reliably have four Host Instances banging data into SQL without fear of them tripping over each other.
At a scheduled time, have BizTalk extract all of the records that are marked as unprocessed in that SQL Table (the WCF-SQL Adapter can help you here). Once you have polled the records, mark them as 'in-process'.
You should now have a single message containing all of the currently unprocessed records (as retrieved from SQL). Using a single (or multiple) Host Instance/s, write the message to disk, appending each of the records to the file in a single write. The key here is that you are only writing a single message to the one file, not lots and lots and lots :-)
If the write is successful, update each of the records in the SQL table with a 'processed' flag so they are not picked-up again on the next poll.
You might want to consider a singleton orchestration for this piece to ensure that there is only ever one poll-write-update process taking place at the same time.
If FIFO is important, BizTalk has ordered delivery mechanism (FILE adapter supported) but it comes at performance cost.
The better solution would be let instances writing to individual files and then have another scheduled process (or orchestration) to combine them in one file. You can enforce FIFO using timestamps. This would provide better performance and resource utilization vs. mentioned earlier singleton orchestration. Other option may be using any suitable implementation of a queue.
You can move to a database system instead of a file. That would be very simply solution and also very efficient.
If you don't want to go that way, you must implement file locking or a semaphore inside of your application so the new threads will wait for other threads to finish writing.

Fastest way to read from database table in cluster environment

What would be a best approach to read from very big database table in clustered environment.
Lets say we need to read huge DB table as fast as we can and then send this data to jms queue. And we would like to avoid the same data to be read since it will require processing, so preferably no intersections. And this application to be deployed in jboss cluster so nodes should some how to communicate.
So in one node case - non clustered environment I just can have one process reading the table.
In two node case - this reading should be some how coordinated to avoid the same data to be read by both nodes... Three nodes etc...
There is no knowledge on how many nodes would be in target environment, nodes can communicate using db table or jboss cache
So it is clear that read in blocks or pages per process will give maximum performance.
And it would be easy task in simple java multi threading environment since we know how many threads would be reading and it easy math how to divide in pages and assign read of page to a single thread.
But in unknown how many nodes scenario there should be some protocol between nodes to communicate and optimize reading.
As you have to keep huge DB data distributed I'd suggest you to take a look into some kind of distributed hash tables. I used GemFire in one of enterprise project with the same requirements and it's well-proven. But you always have a limit of max DB connections so you can't grow limitless.

Resources