Data archiving solutions - database

I have an application set up using mysql at the backend with about 130 tables, total size is currently more than 30-40 GB and growing fast.
Our db is well optimized but we believe that due to the size of the database , the performance is taking a hit.
I need to implement a process to archive data, after a little reading i read that i could push all archivable data to hadoop , what i need to know is , is there any way by which i can directly hit hadoop to retrieve data from my backend (codeigniter,cakephp,django etc...) Thanks

I think you could try Apache Sqoop: http://sqoop.apache.org/
Sqoop 1 was originally designed for moving data from relational databases to Hadoop. Sqoop 2 is more ambitious and aims to move data between any two sources.

Related

best approach for storing images

I am working on a .NET web application. The application has a separate database (SQL Server 2012) for storing and retrieving images. we have functionality in an application where multiple scanned documents are attached and stored in a database against each process. The daily increase in the size of this document DB is about 9 GB. I am stuck with the handling of this huge document DB. I am looking for the following questions
Should we use SQL Server database or Mongo DB or any other database for this scenario?
What should we do to better storing these daily increasing images, should we use any partitioning technique or any archiving technique, or any file system management technique.
what are the best practices for managing such huge images databases?
how to control its size limit, either we should apply any compression technique?
The system is mainly used for inserting documents and these documents are accessed rarely.
how can we optimize its storage?

Need a solution to get rid of multiple database

In my company we have multiple database structure hosted in SQL Server.
for e.g., whenever a new customer sign up with us, we create a new DB in SQL Server to maintain their data.
Right now we already have 2000+ DBs in our database server. We expect more customers to sign up in near future, which might even cross 5000+ count.
Having DBs of 5000+ and increasing count of DBs might not be an advisable one, sometimes we run some task which will run across the DBs, and if we are going to run tasks across 5000+ DBs we will surely end up in performance issues.
What would be the alternative solution to avoid creating multiple DBs for each and every customer and also at the same time maintaining their data separately?
I am hearing about BigData and other DataBase solutions but could not get clear picture.
Can someone share some light on this?
If the databases have an identical schema you could combine them into one. That way each customer's table will now become a set of rows in the new database. A new customer will probably be a few new rows in the tables that store customer's profile.
You can use row level security for restricting access to customer's data:-
https://msdn.microsoft.com/en-us/library/dn765131.aspxpx
For pros and cons of using this approach over your existing see: Pros/Cons Using multiple databases vs using single database and Single or multiple databases
Using other options provide great learning opportunity but may have a significant transition cost even if there were some that were indeed better.
one solution I would suggest is to use prefix on the table name for each customer. you can then solve the security issue by limit per customer per set of tables.
the con is you will have to rewrite your application to use prefix to each table whenever it want to access it. If you have a lot of tables , that will be a problem.
I think this is how some multi Wordpress hosting site handle it database issue.
you should consider if you just store the data and access it with simple querys or if you usually do complex query's, if you just store the data and access it with simple querys and your need are not 100% relational maybe you should consider to move part of your data to HDFS file system:
https://en.wikipedia.org/wiki/Apache_Hadoop#HDFS .
To process the data in hadoop there are many tools but the raising one for sure is spark:
https://en.wikipedia.org/wiki/Apache_Spark
probably the best solution is to start move your historic data in HDFS just for storage and keep the rest as it is until you take confidence with the hadoop and spark paradigm
hadoop is a distributed , fault tollerant file system and spark is an engine for batch processing huge amount of unstructured or structured data, consider that data in hadoop are not structure usually so you have to change the way you process your data, if you want to still use sql I suggest to check Impala and Hive as well:
http://impala.io/
https://hive.apache.org/
Take a look at cloudera web site for a more structure IT solution instead of a lot of single tool that you will need to organize
http://www.cloudera.com/content/www/en-us/solutions.html
They have a quick start VM to try all the hadoop ecosystem tools , probably thats the best way to start experimenting:
http://www.cloudera.com/content/www/en-us/downloads/quickstart_vms/5-4.html

Using Hadoop to perform DML operations on large fixed-format files

We have a product that uses a MySQL database as the data-store. The data-store holds large amount of data. The problem we are facing is that the response time of the application is very slow. The database queries are very basic with very simple joins, if any. The root cause for the slow response time according to some senior employees is the database operations on the huge data-store.
Another team in our company had worked on a project in the past where they processed large fixed-format files using Hadoop and dumped the contents of these files into database tables. Borrowing from this project, some of the team members feel that we can migrate from using a MySQL database to simple fixed-format files that will hold the data instead. There will be one file corresponding to each table in the database instead. We can then build another data interaction layer that provides interfaces for performing DML operations on the contents in these files. This layer will be developed using Hadoop and the MapReduce programming model.
At this point, several questions come to my mind.
1. Does the problem statement fit into the kind of problems that are solved using Hadoop?
2. How will the application ask the data interaction layer to fetch/update/delete the required data? As far as my understanding goes, the files containing the data will reside on HDFS. We will spawn a Hadoop job that will process the required file (similar to a table in the db) and fetch the required data. This data will be written to an outout file on HDFS. We will have to parse this file to get the required content.
3. Will the approach of using fixed format-files and processing them with Hadoop truly solve the problem?
I have managed to set up a simple node cluster with two Ubuntu machines but after playing around with Hadoop for a while, I feel that the problem statement is not a good fit for Hadoop. I could be completely wrong and therefore want to know whether Hadoop fits into this scenario or is it just a waste of time as the problem statement is not in line with what Hadoop is meant for?
I would suggest go straight to Hive (http://hive.apache.org/). It is SQL engine / datawarehouse build on top of the Hadoop MR.
In a nutshell - it get Hadoop scalability and hadoop high latency.
I would consider storing bulk of data there, do all required transformation and only summarized data move to MySQL to serve queries. Usually it is not good idea to translate user requests to the hive queries - they are too slow, capability to run jobs in parallel is not trivial.
If you are planning to update data more often then storing directly in hadoop may not be a good option for you. To update a file in hadoop you may have to rewrite the file and then delete old file and copy a new file in hdfs.
However if you are just searching and joining the data then its good option. If you use hive then you could make some queries like sql.
In hadoop your work flow could be something described below:
You will run a hadoop job for your queries.
Your hadoop program will parse query and execute some job to join
and read files based on your queries and input parameters.
Your output will be generated in hdfs.
You will copy the output to local file system. Then show the output to your program.

Hadoop is it recommended only for distributed env?

I have a database whose size could go upto 1TB in a month. If I do a query directly, its taking a long time. So I was thinking of using Hadoop on top of the Database - most of the time my query would involve searching entire database. My database instance would be either 1 or 2, not more than that. After a while we purge the database.
So can we use hadoop framework since it helps processing large amount of data?
Hadoop is not "something you query" but you can use it to process a large amount of data and create a search index which you then load into a system you can query.
You can also look into HBase if you want a store for big data. In addition to HBase there are a number of other key-value or non-relational (NoSQL) stores that work well with large data.
A proper answer depends on the kind of query you are running. Are you always running a specific query? If so, then a key-value store works well; just choose the right keys. If your query needs to search the entire database as you say, and you only make one query every hour or two, then yes, in principle, you could write a simple "query" in Hive that will read from your HDFS store.
Note that querying in Hive only saves you time versus an RDBMS or a simple grep when you have a lot of data and access to a decent-sized cluster. If you only have one machine, it's a non-solution.
Hadoop works better on distributed system. Moreover 1TB is not big data., for this your relational database will do the job.
The real power of hadoop comes when you have to process 100 TB or more of data .. where the relational databases fail.
If look into Hbase it is fast but it is not a substitute to your MySQL or Oracle..

Database that consumes less disk space

I'm looking at solutions to store a massive quantity of information consuming the less possible disk space.
The information structure is very simple and the queries will also be very simple.
I've looked at solutions like Apache Cassandra and relations databases but couldn't find a comparison where disk usage is mentioned.
Any ideas on this would be great.
Speaking about Apache Cassandra - it's just a disk space hog. 200 MB of logs resulted in 1.2 GB files produced by Cassandra - and the keyspace was just 4 columns with 200 length strings.
Take a look at Oracle Berkeley DB - very simple robust database (key/value):
"Berkeley DB enables the development of custom data management solutions, without the overhead traditionally associated with such custom projects. Berkeley DB provides a collection of well-proven building-block technologies that can be configured to address any application need from the handheld device to the datacenter, from a local storage solution to a world-wide distributed one, from kilobytes to petabytes."
Redis might worth a check if you can store your data in key-value
Newest version of Microsoft's SQL Server (2008) supports several levels of compression (row compression and page compression, in addition to backup compression). Might be worth investigating.
Some relevant resources:
Linchi Shea shows that compression can sometimes improve performance
Official MS Best Pracices doc for SQL 2008 compression

Resources