how to handle very large data?

how to handle very large data? - database

I'm about to start a new project which is basically a reporting tool which should have a rather very large database.
The number of tables will not be large (<200), majority of data (80%) will be contained in 20 tables, all data are almost insert/read only (no updates).
The estimated amount of data in that one table is going to grow at 240,000 records per minute , and we should keep at least 1 to 3 year of them to be able to do various reports and reports will be seen online by administrator.
I don't have first hand experience with that large databases, so I'm asking the ones that have which DB is the best choice in this situation. I know that Oracle is the safe bet, but am more interested if anyone have experience other than database like hadoopdb or Google's big table.
please guide me .
thanks in advance

Oracle is going to get very expensive to scale up enough. MySQL will be hard to scale. It's not their fault; an RDBMS is overkill for this.
Let me start with a dumb question: what are you doing with this data? "various reports" could be a lot of things. If these reports can be generated in bulk, offline, then, why not keep your data in a flat file on a shared file system?
If it needs to be more online, then yes the popular wisdom from the past 2 years is to look at NoSQL databases like Mongo, Couch and Cassandra. They're simpler, faster creatures that scale easily and provide more random access to your data.
Doing analytics on NoSQL is all the rage this year. For example, I'd look at what Acunu is doing to embed analytics into their flavor of Cassandra: http://www.acunu.com/blogs/andy-twigg/acunu-analytics-preview/

You can Also Use Apache Solr And MongoDB.
Mongo DB and Apache Solr are alos used for Handling Big data in NOSQL its very fast to insert and retrieve data into database.
So you can use Apache Solr Or MongoDb database.

Related

Should I change from a MariaDB to Splunk or use a Splunk DBX connection?

Problem
Currently, I have my data in a MariaDB and there has been a recent push in my group to move things to Splunk. Everyone in my group is a DB novice, we can push things and pull them out but as far as making smart decisions on which kind of DB to use, it's the blind leading the blind. The biggest draw of Splunk is the ease of creating "dashboards" that could help us to use the data more effectively. I'm trying to understand the best choice between switching everything into Splunk or just using the Splunk DBX thing to get the benefits of the easy dashboards and keep the MariaDB the same.
My Data
There are 2 separate databases that do different things but they are set up almost identically so I'll just talk about one of them and when I talk about the amount of traffic I'll use the numbers from the one with the most traffic.
There is 1 table that contains a list of tools and information on those tools.
There are 3 other tables that log the usages of those tools using a foreign key(I think!) to point to the tools table and it tracks by user per week. There are 3 different tables because there are 3 different tool types that I track different kinds of information. But they have a similar setup to this one.
Example:
id
tool_id
user
week
usages
1
5
Joe
w5
5
4
5
Joe
w6
3
Each time a tool is used I update the row by incrementing the usages column.
Expected DB Traffic
Adding/Updating data happens sometimes thousands of times an hour, so far the in-house DBaaS has handled the traffic without a problem.
For querying, I do queries of essentially the entire DB maybe 10 times a day, if that, BUT, when a tool is used I query the tools table to figure out what usage tracking table to use, then do the add/update on the appropriate table.
Limitations
I'm limited to MariaDB, PostgreSQL, and Splunk databases at this time. I have minimal Splunk experience and no PostgreSQL experience. I'm most experienced in MariaDB but that just means I can understand the documentation and follow it to make more elaborate queries.
Questions
Should I stay with MariaDB and connect to Splunk for the easier querying and dashboard creations for more real time data and better analysis of the data OR move entirely to Splunk?
Is there another option I don't know about?
Please also feel free to offer any other advice you may have!
Why Can't I Figure This Out Myself?
I've tried for several days now, but everything I can find is too vague with their descriptions. To me a DB that is ten-thousand rows long is large, but apparently a real DB expert would laugh at me for thinking that. So I can't tell if because of the size of my data maybe it doesn't even matter what I do or maybe it's extremely important.

Splunk is not a database so it won't replace MariaDB (or any other database program). The main reason being data in Splunk is immutable so the DB concept of updating a row has no equivalent in Splunk.
Consider using Splunk's DB Connect plug-in to connect it to your MariaDB. That gives Splunk the ability to pull data from the database so it can be analyzed and/or visualized.
Also consider other visualization tools like Tableau.

Fast JSON/flat data server for mostly reads

I ask this question apprehensively because it is not a pure programming question, and because I am seeking a (well informed) suggestion.
I have an analytic front end, written in JavaScript, with lots of aggregations and charting happening in the browser (dimple.js, even stats.js, ...)
I want to feed this application with JSON or delimited data from some high performance data structure server. No writes except for loading. Data will be maybe 1-5 GB in size and there could be dozens, if not hundreds concurrent readers, but only in peak hours. This data is collected from and fed by Apache Hive.
Now my question is about the selection of a database/datastore server choices for this.
(I have pretty good command of SQL/NoSQL choices, so I am really seeking advice for the very specific requirements)
Requirements and specifications for this datastore are:
Mostly if not all queries will be reads, initiated by the web, JS-based front end.
Data can be served as JSON or flat tabular csv, psv, tsv.
Total data size on this store will be 1-5 GB, with possible future growth, but nothing imminent (6-12 months)
Data on this datastore will be refreshed/loaded into this store daily. Probably never in a real time.
Data will/can be accessed via some RESTful web services, Socket IO, etc.
Faster read access, the better. Speed matters.
There has to be a security/authentication method for sensitive data protection.
It needs to be reasonably stable, not a patching-requiring bleeding edge.
Liberal, open source license.
So far, my initial candidates for examination were Postgres (optimized for large cache) and Mongo. Just because I know them pretty well.
I am also familiar with Redis, Couch.
I did not do benchmark myself, but I have seen benchmarks where Postgres was faster than Mongo (while offering JSON format). Mongo is web-friendlier.
I am considering in-memory stores with persistence such as Redis, Aerospike, Memcached. Redis 3.0 is my favorite so far.
So, I ask you here if you have any recommendations for the production quality datastore that would fit well what I need.
Any civil and informed suggestions are welcome.

What exactly does your data look like? Since you said CSV like exports, I'm assuming this is tabular, structured data that would usually be found in a relational database?
Some options:
1. Don't use a database
Given the small dataset, just serve it out of memory. You can probably spend a few hours to write a quick app with any decent web framework that just loads up the data into memory (for example, from a flat file) and then searches and returns this data in whatever format and way you need.
2. Use an embedded database
You can also try an embedded database like SQLite which gives you in-memory performance but with a reliable SQL interface. Since it's just a single-file database, you can have another process generate a new DB file, then swap it out when you update the data for the app.
3. Use a full database system
Use a regular relational database. mySQL, PostgreSQL, SQL Server (Express Edition) are all free and can handle that dataset easily and will just cache it all in RAM. If it's read queries, I don't see any issues with a few hundred concurrent users. You can also use memSQL community edition if you need more performance. They all support security, are very reliable, and you can't beat SQL for data access.
Use a key/value system if your data isn't relational or tabular and is more of a fit as simple values or documents. However remember KV stores aren't great at scans or aggregations and don't have joins. Memcached is just a distributed cache, don't use it for real data. Redis and Aerospike are both great key/value systems with Redis giving you lots of nice data structures to use. Mongo is good for data flexibility. Elasticsearch is a good option for advanced search-like queries.
If you're going to these database systems though, you will still need a thin app layer somewhere to interface with the database and then return the data in the proper format for your frontend.
If you want to skip that part, then just use CouchDB or Riak instead. Both are document oriented and have a native HTTP interface with JSON responses so you can consume it directly from your frontend, although this might cause security issues since anyone can see the javascript calls.

Recommended Setup for BigData Application

I am currently working on a long term project that will need to support:
Lots of fast Read/Write operations via RESTful Services
An Analytics Engine continually reading and making sense of data
It is vital that the performance of the Analytics Engine not be affected by the volume of Reads/Writes coming from the API calls.
Because of that, I'm thinking that I may have to use a "front-end" database and some sort of "back-end" data warehouse. I would also need to have something like Elastic Search or Solr indexing the data stored in the data warehouse.
The Questions:
Is this a Recommended Setup? What would the alternative be?
If so...
I'm considering either Hive or Pig for the data-warehousing, and Elastic Search or Solr as a Search Engine. Which combination is known to work better together?
And finally...
I'm seriously considering Cassandra as the "fron-end" database. What is the relation between Cassandra and Hadoop, and when/why should they be put to work together instead of having just Cassandra?
Please note, my intention is NOT to start a debate about which of these is better, but to understand how can they be put to work better more efficiently. If it makes any difference, the main code is being written in Scala and Java.
I truly appreciate your help. I'm basically learning as I go and all comments will be very helpful.
Thank you.

First let's talk about Cassandra
This is a NoSQL database with eventual consistency which basically means for you that different nodes into a Cassandra cluster may have different 'snapshots' of data in the case that there is an inter cluster communication/availability problem. The data eventually will be consistent however.
Since you consider it as a 'frontend' database what you need to understand is how you will model your data. Cassandra can take advantage of indexes however you still need to defined upfront your access pattern.
Normally there is no relation between Cassandra and Hadoop (except that both are written in Java) however the Datastax distribution (enterprise version) has Hadoop support directly from Cassandra.
As a general workflow you will read/write most current data (let's say - last 24 hours) from your 'small' database that enough performance (Cassandra has excellent support for it) and you would move anything older than X (older than 24 hours) to a 'long term storage' such as Hadoop where you can run all sort of Map Reduce etc.
In regards to the text search it really depends what you need - Elastic Search is sort of competition to Solr and reverse. You can see yourself how they compare here http://solr-vs-elasticsearch.com/

As for your third question,
I think Cassandra is more like a database to save data.
Hadoop is responsible to provide a compution model to let you analyze your large data in
Cassandra.
So it is very helpful to combine Cassandra with Hadoop.
Also have other ways you can consider, such as combine with mongo and hadoop,
for mongo has support mongo-connector between hadoop and it's data.
Also if you have some search requirements , you can also use solr, directly generated index from mongo.

Neo4j + Redis? Good or bad?

I making a music app with social networking features. I was hoping to power my database with Neo4j and Redis. In Neo4j I will store user info and all other information ( post, reviews, etc.) in redis. Does anyone have any advice or insight on this?

Short answer: it depends.
Longer answer:
I'm assuming that you are just starting with the app and want to have quick feedback if it is a thing you want to invest (time/money) in.
If you want to run queries like "which users reviewed the same song" you need to put this data into Neo4J. In general, the more connected data you have there, the more interesting the questions you can answer. So I would err on the side of putting data into Neo4j. Also, only querying one database is easier to implement than aggregating data over multiple ones.
If you get enough users that the amount of data they produce starts to impact Neo4j, you can put the actual review text or post into redis and reference it by an id from Neo4j. But by then you already know it is worth doing and this is a fairly manageable refactoring and data migration.

Neo4j is a graph database. However it does not support sharding (horizontal partitioning). The good thing about using Neo4j is that you can store a graph data structure and run graph algorithms easily with Neo4j query language. This may be useful for analyzing some social network properties. The bad thing, is, because Neo4j does not support sharding, the capacity of the database is limited to a single node. When the data size increases, its performance may be impacted.
Redis is always useful for caching data, which can be a good choice.

IMHO, I will try to store all in neo4j in the same case.

The difficulty of choosing right database for analytics

I need some help deciding which database we should choose for our project. We are developing a web application that collects data about user's behavior and analyses that (bad explanation, but I can't provide much more detail; web analytics data is one of our core datasets). We have estimated that we will insert approx 200 million rows per week into database + data calculated from that raw data. The data must be retained for at least six months.
I have spent last week and half gathering information about different solutions, but there seems to be so many that I feel lost. Most promising ones I found are Cassandra, Hbase and Hive. I also looked at MongoDb, Redis and some others, but they looked like they suited different needs or community wasn't that active.
The whole app will be run in Amazon's EC2. As a startup company pay-as-you-go pricing model fits us like a glove. The easier the database is to manage in the cloud, the better.
Scalability is important. The amount of data we will generate varies quite much and will grow over time.
We can't pay huge licensing fees. Otherwise we would probably use something like http://www.vertica.com/.
We need to do all sorts of analysis on data, and the easier they are write the better. I thought about using Map/Reduce for the task; Hbase seems to have better support for this than Cassandra, and Hive has it's own query language. Real-time analysis isn't needed; we can calculate results once a day and shovel those back to database for fast retrieval.
Compression support would be nice, but not necessary (disk space is cheap :).
I also though about using MySql (because we will use that for all the user information etc. anyway), but scaling will be much harder in the future and I think at some point we would have to move to some other db anyway. We are also more than willing to commit some time and effort to push the selected database forward in terms of development.

We have decided to go on with Hadoop(& Hive/Hbase) as our primary data store. Main reasons for this are:
It is proven technology, and many big sites are using it (Facebook...).
Lot's of documentation around and even Hadoop books have been written.
Hive provides nice SQL-like query language and command line, so even guys who don't know Java/Python/etc. can write queries easily.
It's free and community people seem to be helpful :)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight