Hadoop is it recommended only for distributed env? - database

I have a database whose size could go upto 1TB in a month. If I do a query directly, its taking a long time. So I was thinking of using Hadoop on top of the Database - most of the time my query would involve searching entire database. My database instance would be either 1 or 2, not more than that. After a while we purge the database.
So can we use hadoop framework since it helps processing large amount of data?

Hadoop is not "something you query" but you can use it to process a large amount of data and create a search index which you then load into a system you can query.
You can also look into HBase if you want a store for big data. In addition to HBase there are a number of other key-value or non-relational (NoSQL) stores that work well with large data.
A proper answer depends on the kind of query you are running. Are you always running a specific query? If so, then a key-value store works well; just choose the right keys. If your query needs to search the entire database as you say, and you only make one query every hour or two, then yes, in principle, you could write a simple "query" in Hive that will read from your HDFS store.
Note that querying in Hive only saves you time versus an RDBMS or a simple grep when you have a lot of data and access to a decent-sized cluster. If you only have one machine, it's a non-solution.

Hadoop works better on distributed system. Moreover 1TB is not big data., for this your relational database will do the job.
The real power of hadoop comes when you have to process 100 TB or more of data .. where the relational databases fail.
If look into Hbase it is fast but it is not a substitute to your MySQL or Oracle..

Related

Data archiving solutions

I have an application set up using mysql at the backend with about 130 tables, total size is currently more than 30-40 GB and growing fast.
Our db is well optimized but we believe that due to the size of the database , the performance is taking a hit.
I need to implement a process to archive data, after a little reading i read that i could push all archivable data to hadoop , what i need to know is , is there any way by which i can directly hit hadoop to retrieve data from my backend (codeigniter,cakephp,django etc...) Thanks
I think you could try Apache Sqoop: http://sqoop.apache.org/
Sqoop 1 was originally designed for moving data from relational databases to Hadoop. Sqoop 2 is more ambitious and aims to move data between any two sources.

Need Suggestions: Utilizing columnar database

I am working on a project which is highly performance dashboard where results are mostly aggregated mixed with non-aggregated data. First page is loaded by 8 different complex queries, getting mixed data. Dashboard is served by a centralized database (Oracle 11g) which is receiving data from many systems in realtime ( using replication tool). Data which is shown is realized through very complex queries ( multiple join, count, group by and many where conditions).
The issue is that as data is increasing, DB queries are taking more time than defined/agreed. I am thinking to move aggregated functionality to Columnar database say HBase ( all the counts), and rest linear data will be fetched from Oracle. Both the data will be merged based on a key on App layer. Need experts opinion if this is correct approach.
There are few things which are not clear to me:
1. Will Sqoop be able to load data based on query/view or only tables? on continuous basis or one time?
2. If a record is modified ( e.g. status is changed), how will HBase get to know?
My two cents. HBase is a NoSQL database build for fast lookup queries, not to make aggregated, ad-hoc queries.
If you are planning to use a hadoop cluster, you can try hive with parquet storage formart. If you need near real-time queries, you can go with MPP database. A commercial option is Vertica or maybe Redshift from Amazon. For an open-source solution, you can use InfoBrigth.
These columnar options is going to give you a greate aggregate query performance.

Column-Oriented Databases for Financial Data Analysis

I currently have a lot of financial data I would like to analyze and compute on. I have built a data system that reads from flat-files and does some decently intelligent caching to maintain the performance I want. But I am starting to have to much data for this system...
I was currently thinking about using POSTGres and having a schema sort of like this:
Table: Things
Fields: T_id, Row, Sub-Row, Column, Resolution, Readable-Name, Meta
Table: Data
Fields: d_id, T_id, timestamp, value
I was wondering if POSTGres would be performant with the above schema if my data table has billions of rows.
Another Idea I had was using a column-oriented database, but I can't seem to find any good open-source ones to get started with. Cassandra is really not made for this situation as I will be reading much much more than writing.
Depends on your expecting - PostgreSQL probably can process this queries on your schema, but it can be a minutes or hours long query - depends on processed rows - but column store databases can be faster about 10 times - just PostgreSQL is relational OLTP database and your schema is not well normalized and probably you prefer OLAP.
There are some open source column store databases like MonetDB or LucidDB, but they are not from PostgreSQL's space. There are only commercial database Vertica. You can look on MySQL engines http://www.mysqlperformanceblog.com/2010/08/16/testing-mysql-column-stores/
Answer depends on your budget.
Here is list of solutions which we using on practice(from cheap to expensive):
MongoDB
PostgreSQL
InfiniDB
kdb+

Hadoop and database

I am currently looking at an issue where I am trying to integrate hadoop with a database, since hadoop offers parallelism but not performance. I was referring the paper of hadoopDB. Hadoop usually takes a file and splits it into chunks and places these chunks in different data nodes. During processing the namenode tells the location where a chunk might be found and runs a map on that node. I am looking at a possiblility of the user telling the namenode which datanode to run the map on and the namenode either runs the map to get the data from a file or a database. Can you kindly tell me whether it is feasible to tell the namenode which datanode to run the map ?
Thanks!
Not sure why you would like to tie a map/reduce task to a particular node. What happens if that particular node goes down? In Hadoop the map/reduce operations cannot be tied to a particular node in the cluster that what makes Hadoop more scalable.
Also, you might want to take a look # Apache Sqoop for importing/exporting between Hadoop and Database.
If you are looking to query data from a distributed data store, then why don't you consider storing your data into Hbase which is a distributed data base built on top of Hadoop and HDFS. It stores data into HDFS in the background and gives query semantics like a big database. In that case you don't have to worry about issuing queries to the right data node. The query semantics of HBase (also known as hadoop database will take care of the same).
For easy querying and storing data into Hbase and if your data is timeseries data, then you can also consider using OpenTSDB which is a wrapper around Hbase and provides you with easy tag based query semantics as well as integrates nicely with GNUPlot, to give you graph like visualization of your data.
Hbase is very well suited for random reads/writes to a very large distributed data store however, if your queries operate on bulk writes/reads Hive maybe a well suited solution for your case. Similar to Hbase, it is also built on top of Hadoop Map Reduce and HDFS and converts each query to underlying map-reduce jobs. The best thing about Hive is that it provides SQL like semantics and you can query just like you would do on a relational database.
As far as organization of data and a basic introduction to the features of Hive is concerned you may like to go through the following points:
Hive adds structure to the data stored on HDFS. The schema of tables is stored in a separate metadata store. It converts SQL like semantics to multiple map reduce jobs running on HDFS in the backend.
Traditional databases follow the schema on write policy where once a schema is designed for a table, at the time of writing data itself, it is checked whether the data to be written conforms to the pre-defined schema. If it does not, the write is rejected.
In case of Hive, it is the opposite. It uses the schema on read policy. Both the policies have their own individual trade-offs. In case of schema on write, load time is more and loads are slower because schema conformance is verified at the time of loading data. However, it provides faster query time because it can index data based on predefined columns in the schema, however there may be cases where the indexing cannot be specified while populating the data initially and this is where schema on read comes in handy. It provides the option to have 2 different schema present on the same underlying data depending on the kind of analysis required.
Hive is well suited for bulk access, updates of data as a new update requires a completely new table to be constructed. Also, query time is slower as compared to traditional databases because of the absence of indexing.
Hive stores the metadata into a relational database called the “Metastore”.
There are 2 kinds of tables in Hive:
Managed tables - Where the data file for the table is predefined and is moved to the hive warehouse directory on HDFS (in general, or any other hadoop filesystem). When a table is deleted, in that case, the metadata and the data both are deleted from the filesystem.
External tables - Here you can create data into the table lazily. There is no data moved to the Hive warehouse directory in this case and the schema/metadata is loosely coupled to the actual data. When a table is deleted, only the metadata gets deleted and the actual data is left untouched. It becomes helpful in cases if you want the data to be used by multiple databases. Another reason of using the same maybe when you need multiple schemas on the same underlying data.

postgresql fast transfer of a table between databases

I have a postgresql operational DB with data partitioned per day
and a postgresql data warehouse DB.
In order to copy the data quickly from the operational DB to the DWH I would like to copy the tables as fast and with least of resources used.
Since the tables are partitioned by day, I understand that each partition is a table as itself.
Is that means I can somehow copy the data files between the machines and create the tables in the DWH with those data files?
What is the best practice in that case?
EDIT:
I will answer all questions asked in here:
1. I'm building an ETL. First step of ETL is to copy the data with less influence on the operational DB.
2. I would want to replicate the data if this won't slow the operational DB writings.
3. A bit more data, The operational DB is not in my responsepbility but the main concern is the write time on the that DB.
It writes about 500 Million rows a day where there are hours that are more loaded but there aren't hours with no writings at all.
4. I came across with few tools/ways - Replication, pg_dump. But I couldn't find something that compare the tools to know when to use what and to understand what is fit to my case.
If you are doing a bulk transfer I would actually consider running pg_dump on the warehouse system and piping the results into psql once a day. You could probably run Slony too but that woudl require more resources, and would probably be more complicated.
There are many good ways to replicate data between databases. While just looking for a
fast transfer of a table between databases
... a simple and fast solution is provided by the extension dblink. There are many examples here on SO. Try a search.
If you want a wider approach, continued synchronization etc. consider one of the established tools for replication. There is nice comparison in the manual to get you started.

Resources