Using database with Hadoop cluster

Using database with Hadoop cluster - database

Currently I have a small Hadoop cluster that performs a MapReduce task on my input data and generates some output. What I would like to do is store this data in a database so that it can be queried for analysis. I would like the database to simulate the ACID properties, so that any change in one node is reflected upon the entire cluster. Then if a node fails there will be others containing the current data.
I have been researching things like Hive with its ACID transactions, but is this all I would need to accomplish that?

Related

Loading data from SQL Server to Elasticsearch

Looking for suggesting on loading data from SQL Server into Elasticsearch or any other data store. The goal is to have transactional data available in real time for Reporting.
We currently use a 3rd party tool, in addition to SSRS, for data analytics. The data transfer is done using daily batch jobs and as a result, there is a 24 hour data latency.
We are looking to build something out that would allow for more real time availability of the data, similar to SSRS, for our Clients to report on. We need to ensure that this does not have an impact on our SQL Server database.
My initial thought was to do a full dump of the data, during the weekend, and writes, in real time, during weekdays.
Thanks.

ElasticSearch's main use cases are for providing search type capabilities on top of unstructured large text based data. For example, if you were ingesting large batches of emails into your data store every day, ElasticSearch is a good tool to parse out pieces of those emails based on rules you setup with it to enable searching (and to some degree querying) capability of those email messages.
If your data is already in SQL Server, it sounds like it's structured already and therefore there's not much gained from ElasticSearch in terms of reportability and availability. Rather you'd likely be introducing extra complexity to your data workflow.
If you have structured data in SQL Server already, and you are experiencing issues with reporting directly off of it, you should look to building a data warehouse instead to handle your reporting. SQL Server comes with a number of features out of the box to help you replicate your data for this very purpose. The three main features to accomplish this that you could look into are AlwaysOn Availability Groups, Replication, or SSIS.
Each option above (in addition to other out-of-the-box features of SQL Server) have different pros and drawbacks. For example, AlwaysOn Availability Groups are very easy to setup and offer the ability to automatically failover if your main server had an outage, but they clone the entire database to a replica. Replication let's you more granularly choose to only copy specific Tables and Views, but then you can't as easily failover if your main server has an outage. So you should read up on all three options and understand their differences.
Additionally, if you're having specific performance problems trying to report off of the main database, you may want to dig into the root cause of those problems first before looking into replicating your data as a solution for reporting (although it's a fairly common solution). You may find that a simple architectural change like using a columnstore index on the correct Table will improve your reporting capabilities immensely.
I've been down both pathways of implementing ElasticSearch and a data warehouse using all three of the main data synchronization features above, for structured data and unstructured large text data, and have experienced the proper use cases for both. One data warehouse I've managed in the past had Tables with billions of rows in it (each Table terabytes big), and it was highly performant for reporting off of on fairly modest hardware in AWS (we weren't even using Redshift).

Eventually consistent document store database similar to cassandra

I'm looking for an open source data store that scales as easily as Cassandra but data can be queried via documents like MongoDB.
Are there currently any databases out that do this?

In this website http://nosql-database.org you can find a list of many NoSQL databases sorted by datastore types, you should check the Document stores there.
I'm not naming any specific database to avoid a biased/opinion-based answer, but if you are interested in a data store that is as scalable as Cassandra, you probably want to check those which use master-master/multi-master/masterless (you name it, the idea is the same) architecture, where both writes and reads can be split among all nodes in the cluster.
I know Cassandra is optimized towards writes rather than reads, but without further details in the question can't refine the answer with more information.
Update:
Disclaimer: I haven't used CouchDB at all, and haven't tested it's performance either.
Since you spotted CouchDB I'll add what I've found in the official documentation, in the distributed database and replication section.
CouchDB is a peer-based distributed database system. It allows users
and servers to access and update the same shared data while
disconnected. Those changes can then be replicated bi-directionally
later.
The CouchDB document storage, view and security models are designed to
work together to make true bi-directional replication efficient and
reliable. Both documents and designs can replicate, allowing full
database applications (including application design, logic and data)
to be replicated to laptops for offline use, or replicated to servers
in remote offices where slow or unreliable connections make sharing
data difficult.
The replication process is incremental. At the database level,
replication only examines documents updated since the last
replication. Then for each updated document, only fields and blobs
that have changed are replicated across the network. If replication
fails at any step, due to network problems or crash for example, the
next replication restarts at the same document where it left off.
Partial replicas can be created and maintained. Replication can be
filtered by a javascript function, so that only particular documents
or those meeting specific criteria are replicated. This can allow
users to take subsets of a large shared database application offline
for their own use, while maintaining normal interaction with the
application and that subset of data.
Which looks quite scalable to me, as it seems you can add new nodes to the cluster and then all the data gets replicated.
Also partial replicas seems an interesting option for really big data sets, which I'd configure these very carefully, in order to prevent situations where a given query to the database might not yield valid results, for example, in the case of a network partition and having only access to a partial set.

Hadoop and database

I am currently looking at an issue where I am trying to integrate hadoop with a database, since hadoop offers parallelism but not performance. I was referring the paper of hadoopDB. Hadoop usually takes a file and splits it into chunks and places these chunks in different data nodes. During processing the namenode tells the location where a chunk might be found and runs a map on that node. I am looking at a possiblility of the user telling the namenode which datanode to run the map on and the namenode either runs the map to get the data from a file or a database. Can you kindly tell me whether it is feasible to tell the namenode which datanode to run the map ?
Thanks!

Not sure why you would like to tie a map/reduce task to a particular node. What happens if that particular node goes down? In Hadoop the map/reduce operations cannot be tied to a particular node in the cluster that what makes Hadoop more scalable.
Also, you might want to take a look # Apache Sqoop for importing/exporting between Hadoop and Database.

If you are looking to query data from a distributed data store, then why don't you consider storing your data into Hbase which is a distributed data base built on top of Hadoop and HDFS. It stores data into HDFS in the background and gives query semantics like a big database. In that case you don't have to worry about issuing queries to the right data node. The query semantics of HBase (also known as hadoop database will take care of the same).
For easy querying and storing data into Hbase and if your data is timeseries data, then you can also consider using OpenTSDB which is a wrapper around Hbase and provides you with easy tag based query semantics as well as integrates nicely with GNUPlot, to give you graph like visualization of your data.
Hbase is very well suited for random reads/writes to a very large distributed data store however, if your queries operate on bulk writes/reads Hive maybe a well suited solution for your case. Similar to Hbase, it is also built on top of Hadoop Map Reduce and HDFS and converts each query to underlying map-reduce jobs. The best thing about Hive is that it provides SQL like semantics and you can query just like you would do on a relational database.
As far as organization of data and a basic introduction to the features of Hive is concerned you may like to go through the following points:
Hive adds structure to the data stored on HDFS. The schema of tables is stored in a separate metadata store. It converts SQL like semantics to multiple map reduce jobs running on HDFS in the backend.
Traditional databases follow the schema on write policy where once a schema is designed for a table, at the time of writing data itself, it is checked whether the data to be written conforms to the pre-defined schema. If it does not, the write is rejected.
In case of Hive, it is the opposite. It uses the schema on read policy. Both the policies have their own individual trade-offs. In case of schema on write, load time is more and loads are slower because schema conformance is verified at the time of loading data. However, it provides faster query time because it can index data based on predefined columns in the schema, however there may be cases where the indexing cannot be specified while populating the data initially and this is where schema on read comes in handy. It provides the option to have 2 different schema present on the same underlying data depending on the kind of analysis required.
Hive is well suited for bulk access, updates of data as a new update requires a completely new table to be constructed. Also, query time is slower as compared to traditional databases because of the absence of indexing.
Hive stores the metadata into a relational database called the “Metastore”.
There are 2 kinds of tables in Hive:
Managed tables - Where the data file for the table is predefined and is moved to the hive warehouse directory on HDFS (in general, or any other hadoop filesystem). When a table is deleted, in that case, the metadata and the data both are deleted from the filesystem.
External tables - Here you can create data into the table lazily. There is no data moved to the Hive warehouse directory in this case and the schema/metadata is loosely coupled to the actual data. When a table is deleted, only the metadata gets deleted and the actual data is left untouched. It becomes helpful in cases if you want the data to be used by multiple databases. Another reason of using the same maybe when you need multiple schemas on the same underlying data.

Using Hadoop to perform DML operations on large fixed-format files

We have a product that uses a MySQL database as the data-store. The data-store holds large amount of data. The problem we are facing is that the response time of the application is very slow. The database queries are very basic with very simple joins, if any. The root cause for the slow response time according to some senior employees is the database operations on the huge data-store.
Another team in our company had worked on a project in the past where they processed large fixed-format files using Hadoop and dumped the contents of these files into database tables. Borrowing from this project, some of the team members feel that we can migrate from using a MySQL database to simple fixed-format files that will hold the data instead. There will be one file corresponding to each table in the database instead. We can then build another data interaction layer that provides interfaces for performing DML operations on the contents in these files. This layer will be developed using Hadoop and the MapReduce programming model.
At this point, several questions come to my mind.
1. Does the problem statement fit into the kind of problems that are solved using Hadoop?
2. How will the application ask the data interaction layer to fetch/update/delete the required data? As far as my understanding goes, the files containing the data will reside on HDFS. We will spawn a Hadoop job that will process the required file (similar to a table in the db) and fetch the required data. This data will be written to an outout file on HDFS. We will have to parse this file to get the required content.
3. Will the approach of using fixed format-files and processing them with Hadoop truly solve the problem?
I have managed to set up a simple node cluster with two Ubuntu machines but after playing around with Hadoop for a while, I feel that the problem statement is not a good fit for Hadoop. I could be completely wrong and therefore want to know whether Hadoop fits into this scenario or is it just a waste of time as the problem statement is not in line with what Hadoop is meant for?

I would suggest go straight to Hive (http://hive.apache.org/). It is SQL engine / datawarehouse build on top of the Hadoop MR.
In a nutshell - it get Hadoop scalability and hadoop high latency.
I would consider storing bulk of data there, do all required transformation and only summarized data move to MySQL to serve queries. Usually it is not good idea to translate user requests to the hive queries - they are too slow, capability to run jobs in parallel is not trivial.

If you are planning to update data more often then storing directly in hadoop may not be a good option for you. To update a file in hadoop you may have to rewrite the file and then delete old file and copy a new file in hdfs.
However if you are just searching and joining the data then its good option. If you use hive then you could make some queries like sql.
In hadoop your work flow could be something described below:
You will run a hadoop job for your queries.
Your hadoop program will parse query and execute some job to join
and read files based on your queries and input parameters.
Your output will be generated in hdfs.
You will copy the output to local file system. Then show the output to your program.

Hadoop is it recommended only for distributed env?

I have a database whose size could go upto 1TB in a month. If I do a query directly, its taking a long time. So I was thinking of using Hadoop on top of the Database - most of the time my query would involve searching entire database. My database instance would be either 1 or 2, not more than that. After a while we purge the database.
So can we use hadoop framework since it helps processing large amount of data?

Hadoop is not "something you query" but you can use it to process a large amount of data and create a search index which you then load into a system you can query.
You can also look into HBase if you want a store for big data. In addition to HBase there are a number of other key-value or non-relational (NoSQL) stores that work well with large data.
A proper answer depends on the kind of query you are running. Are you always running a specific query? If so, then a key-value store works well; just choose the right keys. If your query needs to search the entire database as you say, and you only make one query every hour or two, then yes, in principle, you could write a simple "query" in Hive that will read from your HDFS store.
Note that querying in Hive only saves you time versus an RDBMS or a simple grep when you have a lot of data and access to a decent-sized cluster. If you only have one machine, it's a non-solution.

Hadoop works better on distributed system. Moreover 1TB is not big data., for this your relational database will do the job.
The real power of hadoop comes when you have to process 100 TB or more of data .. where the relational databases fail.
If look into Hbase it is fast but it is not a substitute to your MySQL or Oracle..

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight