Manage multiple clusters in Hadoop OR Distributed Computing Framework - database

I have five computers networked together. Among them one is master computer and another four are slave computers.
Each slave computer has its own set of data (a very big integer matrix). I want to run four different clustering programs in four different slaves. Then, take the results back into the master computer for further processing (such as visualization).
I initially thought to use Hadoop. But, I cannot find any nice way to convert the above problem (specifically the output results) into the Map Reduce framework.
Is there any nice open-source distributed computing framework by using which I can perform the above task easily?
Thanks in advance.

You should used YARN for manage multiple clusters or resources
YARN is the prerequisite for Enterprise Hadoop, providing resource management and a central platform to deliver consistent operations, security, and data governance tools across Hadoop clusters.
Reference

It seems that you have already stored the data on each of the nodes, so you have already solved the "distributed storage" element of the problem.
Since each node's dataset is different, this isn't a parallel processing problem either.
It seems to me that you don't need Hadoop or any other big data framework. However, you can embrace the philosophy of Hadoop by taking the code to the data. You run the clustering algorithm on each node, and then handle the results in whatever way you need. A caveat would be if you also have a problem in loading the data and running the clustering algorithm on each node, but that is a different problem.

Related

OrientDB in distributed architecture works with vertex replication across servers?

I have worked in a project with OrientDB graph database. I've managed to fill the database and perform the queries in it without problems. But after I needed to run my queries using the distributed feature from OrientDB, and I came with an important (maybe trivial) doubt.
I've managed to use the distributed mode also without problems using 3 differente machines, but I wanted to be sure that OrientDB is really storing my database within the 3 machines that I've used. Is there any way to check that?
When I was researching for this answer, I came to the conclusion that OrientDB replicates the entire database across all the machines, is that correct? The goal to use the distributed architecture was to improve performance, but if OrientDB works with replication, and I run one query in a specific machine, the query will be processed using all machines, or only one?
To be short, I want to know if OrientDB when using the distributed mode, distributes the vertex and edges across the machines, and process the queries using all the machines?
I've read the entire documentation : http://orientdb.com/docs/2.0/orientdb.wiki/Distributed-Architecture.html and could not find a clear explanation for this questions.
Thanks in advance!
OrientDB, by default, replicates the entire DB on all the servers. What you're looking for is called "Sharding". OrientDB supports manual sharding (automatic in the future), that means you (the application) can decide where to store the vertices/edges.

Using ElasticSearch as source of truth

I am working with a team which uses two data sources.
MSSQL as a primary data source for making transaction calls.
ES as a back-up/read-only source of truth for viewing the data.
e.g. If I put an order, The order is inserted in DB, then there is a RabbitMQ listener/ Batch which then synchronizes the data from DB to ES.
Somehow this system fails for even just a million records. When I say fails, it means the records are not updated in ES in timely fashion, e.g. Say I create a coupon, then the coupon is generated in DB, when the coupon is generated, customer tries to redeem it immediately, although ES doesn't have the information about the coupon yet, so it fails. Of course there are options to use RabbitMQ's priority Queues etc, but the questions I have got are very basic
I have few questions in my mind, which I asked to the team, and still haven't got satisfactory answers
What is the minimum load should be expected when we use elastic search, and doesn't it become an overkill if we have just 1M records.
Does it really makes sense to use ES as source of truth for real-time data?
Is ES designed for handling relational-like databases, and to handle the data that gets continuously updated? AFAIK such search-optimized databases are once write, multiple read kind.
If we are doing it to handle load, then how will it be different than making a cluster of MSSQL databases as source of truth and using ES just for analytic?
The main question I have in mind is, how we can optimize this architecture so that we can scale better?
PS:
When I asked minimum load, what I really meant is what is the number of records/transaction for which we can say ES will be faster than conventional relational databases? Or there is no such term at all?
What is the minimum load should be expected when we use elastic search, and doesn't it become an overkill if we have just 1M records.
Answer: the possible load depends on the capabilities of your server
Does it really makes sense to use ES as source of truth for real-time data?
From ES website: "Elasticsearch is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases. As the heart of the Elastic Stack, it centrally stores your data so you can discover the expected and uncover the unexpected."
So yes, it can be your source of truth, that said, it is "eventually consistent" which raises the question, how soon is it considered "real-time"... and there is no way to answer it without testing and measuring your system .
Is ES designed for handling relational-like databases, and to handle the data that gets continuously updated? AFAIK such
search-optimized databases are once write, multiple read kind.
That's a good point, as any eventual-consistent system, it is indeed NOT optimized to series of modifications!
If we are doing it to handle load, then how will it be different than making a cluster of MSSQL databases as source of truth and using
ES just for analytic?
It won't. Bare in mind that ES, as quoted above, was build to accommodate requirements of search and analysis. If that's not what you intend to do with it you should consider another tool. Use the right tool for the right job.
1)
There isn't a minimum expected load.
You can have 2 small nodes (master & data) with 2 shards per index (1 primary + 1 replica).
You can also split your data into multiple indices if it makes sense from a functional point of view (i.e. how data is searched).
2)
In my experience, the main benefits you get from ElasticSearch are:
Near linear scalability.
Lucene-based text search.
Many ways to put your data to work: RESTful query API, Kibana...
Easy administration (compared to your typical RDBMS).
If your project doesn't get these benefits, then most probably ES is not the right tool for the job.
3)
ElasticSearch doesn't like data that is updated frequently. The best use case is for read-only data.
Anyway, this doesn't explain the high latency you are getting; your problem must lie in RabbitMQ or the network.
4)
Indeed, that's what I would do: MSSQL cluster for application data and ES for analytics.

What database is good enough for logging application?

I am writing a web application with nodeJS that can be used by other applications to store logs and accessed later in a web interface or by applications themselves providing an API. Similar to Graylog2 but schema free.
I've already tried couchDB in which each document would be a log doc but since I'm not really using revisions it seems to me I'm not using its all features. And beside that I think if the logs exceeds a limit it would be pretty hard to manage in couchDB.
What I'm really looking for, is a big array of logs that can be sorted, filtered, searched and capped on. Then the last events of it accessed. It should be schema free and writing to it should be non-blocking.
I'm considering using Cassandra(I'm not really familiar with it) due to the points here said. MongoDB seems good here too, since Graylog2 uses in mongoDB, in here it has some good points about it.
I've already have seen this question, but not satisfied with the answers.
Edit:
For some reasons I can't use Cassandra in production, now I'm trying MongoDB.
One more reason to use mongoDB :
http://www.slideshare.net/WombatNation/logging-app-behavior-to-mongo-db
More edits:
It is similar to graylog2, but the difference I want to make that instead of having a message field, having fileds defined by the client, which is why I want it to be schema free, and because of that, I may need to query in the user defined fields. We can build it on SQL, but querying on the user defined fields would be reinventing wheel. Same goes with files.
Technically what I'm looking for is to get rich statistical data in the end, or easy debugging and a lot of other stuff that we can't get out of the logs.
Where shall it be stored and how shall it be retrieved?
I guess it depends on how much data you are dealing with. If you have a huge amount (terabytes and petabytes per day) of logs then Apache Kafka, which is designed to allow data to be PULLED by HDFS in parallel, is a interesting solution - still in the incubation stage. I believe if you want to consume Kafka messages with MongoDb, you'd need to develop your own adapter to ingest it as a consumer of a particular Kafka topic. Although MongoDb data (e.g. shards and replicas) is distributed, it may be a sequential process to ingest each message. So, there may be a bottleneck or even race conditions depending on the rate and size of message traffic. Kafka is optimized to pump and append that data to HDFS nodes using message brokers FAST. Then once it is in HDFS you can map/reduce to analyze your information in a variety of ways.
If MongoDb can handle the ingestion load, then it is an excellent, scalable, real-time solution to find information, particularly documents. Otherwise, if you have more time to process data (i.e. batch processes that take hours and sometimes days), then Hadoop or some other Map Reduce database is warranted. Finally, Kafka can distribute that load of messages and hookup that fire-hose to a variety of consumers. Overall, these new technologies spread the load and huge amounts of data across cheap hardware using software to manage failure and recover with a very low probability of losing data.
Even with a small amount of data, MongoDb is a nice option to traditional relational database solutions which require more overhead of developer resources to design, build and maintain.
General Approach
You have a lot of work ahead of you. Whichever database you use, you have many features which you must build on top of the DB foundation. You have done good research about all of your options. It sounds like you suspect that all have pros and cons but all are imperfect. Your suspicion is correct. At this point it is probably time to start writing code.
You could just choose one arbitrarily and start building your application. If your guess was correct that the pros and cons balance out and it's all about the same, then why not simply start building immediately? When you hit difficulty X on your database, remember that it gave you convenience Y and Z and that's just life.
You could also establish the fundamental core of your application and implement various prototypes on each of the databases. That might give you true insight to help discriminate between the databases for your specific application. For example, besides the interface, indexing, and querying questions, what about deployment? What about backups? What about maintenance and security? Maybe "wasting" time to build the same prototype on each platform will make the answer very clear for you.
Notes about CouchDB
I suppose CouchDB is "NoSQL" if you say so. Other things which are "no SQL" include bananas, poems, and cricket. It is not a very meaningful word. We have general-purpose languages and domain-specific languages; similarly CouchDB is a domain-specific database. It can save you time if you need the following features:
Built-in web API: clients may query directly
Incremental map-reduce: CouchDB runs the job once, but you can query repeatedly at no cost. Updates to the data set are immediately reflected in the map/reduce result without full re-processing
Easy to start small but expand to large clusters without changing application code.
Have you considered Apache Kafka?
Kafka is a distributed messaging system developed at LinkedIn for
collecting and delivering high volumes of log data with low latency.
Our system incorporates ideas from existing log aggregators and
messaging systems, and is suitable for both offline and online message
consumption.

Open source distributed computing/cloud computing frameworks

I was wondering if anyone knows of any good Open Source distributed computing projects? I have a computationally intensive program that could benefit from distributed computing (a la SETI#Home, etc.) and want to know if anyone has seen such a thing or will I be developing it from scratch?
I see that this is over a year old but this is a new and relevant answer
http://openstack.org/
Here's one for java and one for c# and here's an open source grid toolkit.
SETI#Home uses BOINC
MPAPI - Parallel and Distributed Applications Framework.
Sector 0 Article:
http://sector0.dk/?page_id=15.
Gives a good overview of the
framework, architecture and the
theory behind it.
Works on a single machine to 'n'
machines.
Design distributed logic into the
system.
Focuses on message passing to isolate
the state that each thread has access
to i.e. no shared state only
messages.
Is Open Source =] and is MONO
Compatible YAY!
Architecture in a Nutshell
Cluster
Single Main Node
Controls the cluster
Numerous Sub-Nodes (one per machine) which are the work horses of the cluster
Single Registration Server - Binds the cluster together by allowing nodes to register / unregister with cluster notifying
existing nodes
Communication
Node to Node directly. Each worker
communicates with others through the
node.
The messages are not
propagated down through the remoting
layer unless two workers are on
different nodes.
Hadoop if you want to run the machines yourself. Amazon Elastic MapReduce if you want to let others run your workers. Amazon Elastic MapReduce is based on Hadoop.
I have personally used BOINC which is a robust solution, widely used and offer you a great range of possibilities in term of customization.
This is the most complete solution I know. The only problems I had were that it was difficult to use for remote job submission (if you don't have access to the server) and it can be a bit long to setup. But overall it is a very good solution.
If you rather want to implement distributed computing just over a local grid, you can use GridCompute that should be quick to set up and will let you use your application through python scripts.
PS: I am the developer of GridCompute.

Which embedded database capable of 100 million records has an efficient C or C++ API

I'm looking for a cross-platform database engine that can handle databases up hundreds of millions of records without severe degradation in query performance. It needs to have a C or C++ API which will allow easy, fast construction of records and parsing returned data.
Highly discouraged are products where data has to be translated to and from strings just to get it into the database. The technical users storing things like IP addresses don't want or need this overhead. This is a very important criteria so if you're going to refer to products, please be explicit about how they offer such a direct API. Not wishing to be rude, but I can use Google - please assume I've found most mainstream products and I'm asking because it's often hard to work out just what direct API they offer, rather than just a C wrapper around SQL.
It does not need to be an RDBMS - a simple ISAM record-oriented approach would be sufficient.
Whilst the primary need is for a single-user database, expansion to some kind of shared file or server operations is likely for future use.
Access to source code, either open source or via licensing, is highly desirable if the database comes from a small company. It must not be GPL or LGPL.
you might consider C-Tree by FairCom - tell 'em I sent you ;-)
i'm the author of hamsterdb.
tokyo cabinet and berkeleydb should work fine. hamsterdb definitely will work. It's a plain C API, open source, platform independent, very fast and tested with databases up to several hundreds of GB and hundreds of million items.
If you are willing to evaluate and need support then drop me a mail (contact form on hamsterdb.com) - i will help as good as i can!
bye
Christoph
You didn't mention what platform you are on, but if Windows only is OK, take a look at the Extensible Storage Engine (previously known as Jet Blue), the embedded ISAM table engine included in Windows 2000 and later. It's used for Active Directory, Exchange, and other internal components, optimized for a small number of large tables.
It has a C interface and supports binary data types natively. It supports indexes, transactions and uses a log to ensure atomicity and durability. There is no query language; you have to work with the tables and indexes directly yourself.
ESE doesn't like to open files over a network, and doesn't support sharing a database through file sharing. You're going to be hard pressed to find any database engine that supports sharing through file sharing. The Access Jet database engine (AKA Jet Red, totally separate code base) is the only one I know of, and it's notorious for corrupting files over the network, especially if they're large (>100 MB).
Whatever engine you use, you'll most likely have to implement the shared usage functions yourself in your own network server process or use a discrete database engine.
For anyone finding this page a few years later, I'm now using LevelDB with some scaffolding on top to add the multiple indexing necessary. In particular, it's a nice fit for embedded databases on iOS. I ended up writing a book about it! (Getting Started with LevelDB, from Packt in late 2013).
One option could be Firebird. It offers both a server based product, as well as an embedded product.
It is also open source and there are a large number of providers for all types of languages.
I believe what you are looking for is BerkeleyDB:
http://www.oracle.com/technology/products/berkeley-db/db/index.html
Never mind that it's Oracle, the license is free, and it's open-source -- the only catch is that if you redistribute your software that uses BerkeleyDB, you must make your source available as well -- or buy a license.
It does not provide SQL support, but rather direct lookups (via b-tree or hash-table structure, whichever makes more sense for your needs). It's extremely reliable, fast, ACID, has built-in replication support, and so on.
Here is a small quote from the page I refer to above, that lists a few features:
Data Storage
Berkeley DB stores data quickly and
easily without the overhead found in
other databases. Berkeley DB is a C
library that runs in the same process
as your application, avoiding the
interprocess communication delays of
using a remote database server. Shared
caches keep the most active data in
memory, avoiding costly disk access.
Local, in-process data storage
Schema-neutral, application native data format
Indexed and sequential retrieval (Btree, Queue, Recno, Hash)
Multiple processes per application and multiple threads per process
Fine grained and configurable locking for highly concurrent systems
Multi-version concurrency control (MVCC)
Support for secondary indexes
In-memory, on disk or both
Online Btree compaction
Online Btree disk space reclamation
Online abandoned lock removal
On disk data encryption (AES)
Records up to 4GB and tables up to 256TB
Update: Just ran across this project and thought of the question you posted:
http://tokyocabinet.sourceforge.net/index.html . It is under LGPL, so not compatible with your restrictions, but an interesting project to check out, nonetheless.
SQLite would meet those criteria, except for the eventual shared file scenario in the future (and actually it could probably do that to if the network file system implements file locks correctly).
Many good solutions (such as SQLite) have been mentioned. Let me add two, since you don't require SQL:
HamsterDB fast, simple to use, can store arbitrary binary data. No provision for shared databases.
Glib HashTable module seems quite interesting too and is very
common so you won't risk going into a dead end. On the other end,
I'm not sure there is and easy way to store the database on the
disk, it's mostly for in-memory stuff
I've tested both on multi-million records projects.
As you are familiar with Fairtree, then you are probably also familiar with Raima RDM.
It went open source a few years ago, then dbstar claimed that they had somehow acquired the copyright. This seems debatable though. From reading the original Raima license, this does not seem possible. Of course it is possible to stay with the original code release. It is rather rare, but I have a copy archived away.
SQLite tends to be the first option. It doesn't store data as strings but I think you have to build a SQL command to do the insertion and that command will have some string building.
BerkeleyDB is a well engineered product if you don't need a relationDB. I have no idea what Oracle charges for it and if you would need a license for your application.
Personally I would consider why you have some of your requirements . Have you done testing to verify the requirement that you need to do direct insertion into the database? Seems like you could take a couple of hours to write up a wrapper that converts from whatever API you want to SQL and then see if SQLite, MySql,... meet your speed requirements.
There used to be a product called b-trieve but I'm not sure if source code was included. I think it has been discontinued. The only database engine I know of with an ISAM orientation is c-tree.

Resources