Storing and processing high data volume - database

Good day!
I have 350GB unstructured data disaggregated by 50-80 columns.
I need to store this data in NoSQL database and do a variety of selection and map / reduce queries filtered by 40 columns.
I would like to use mongodb, so I have a certain question: is this database able to cope with this task and what do I need to implement its architecture within the existing provider hetzner.de?

Yes, large datasets are easy.
Perhaps Apache Hadoop is also worth looking at. It is aimed at handling/analyzing large/huge amounts of data.

mongodb is a very scalable and flexible database, if used properly. It can store as much data as you need, but the bottom line is whether you can query your data efficiently.
comments:
You will need to make sure you have the proper indexes in place and that a fair amount of them can fit in RAM.
In order to achieve that you may need to use sharding to split the working set
current mapreduce is easy to use, can iterate over all your data but it is rather slow to process. It should become faster in next mongodb and there will also be a new aggregation framework to complement mapreduce.
Bottom line is that you should not take mongodb as a magical store that will be perfect out of the box, make sure you read the good docs and materials :)

Related

What software should I use for graph distributed storing and processing?

Problem in a nutshell:
There's a huge amount of input data in JSON format. Like right now it's about 1 Tb, but it's going to grow. I was told that we're going to have a cluster.
I need to process this data, make a graph out of it and store it in a database. So every time I get a new JSON, I have to traverse the whole graph in a database to complete it.
Later I'm going to have a thin client in a browser, where I'm going to visualize some parts of the graph, search in it, traverse it, do some filtering, etc. So this system is not high load, just a lot of processing and data.
I have no experience in distributed systems, NoSQL databases and other "big data"-like stuff. During my little research I found out that there are too many of them and right now I'm just lost.
What I've got on my whiteboard at the moment:
Apache Spark's GraphX (GraphFrames) for distributed computing on top of some storage (HDFS, Cassanda, HBase, ...) and processor (Yarn, Mesos, Kubernetes, ...).
Some graph database. I think it's good to use a graph query language like Cipher in neo4j or Gremlin in JanusGraph/TitanDB. Neo4j is good, but it has clustering only in EE and I need something open source. So now I'm thinking about the latter ones, which have Gremlin + Cassandra + Elasticsearch by default.
Maybe I don't need any of these, just store graph as adjacency matrix in some RDBMS like Postgres and that's it.
Don't know if I need Spark in 2 or 3. Do I need it at all?
My chief told me to check out Elasticsearch. But I guess I can use it only as an additional full-text search engine.
Thanks for any reply!
Let us start with a couple of follow-up questions :
1Tb is not a huge amount of data if that is also (close to) the total amount of data. Is it ? How much new data are you expecting and at what rate will it arrive.
Why would you have to traverse the whole graph if each JSON is merely referring to a small part of the graph ? It's either new data or an update of existing data (which you should be able to pinpoint), isn't it ?
Yes, that's how you use a graph database ...
The rest sort of depends on your answer on 1). If we're talking about IOT numbers of arriving events (tens of thousands per second ... sustained) you might need a big data solution. If not, your main problem is getting the initial load done and do some easy sailing from there ;-).
Hope this helps.
Regards,
Tom

What is the most effective method for handling large scale dynamic data for recommendation system?

We re thinking on a recommendation system based on large scale data but also looking for a professional way to keeping a dynamic DB structure for working in faster manner. We consider some of the alternative approaches. One is to keep in a normal SQL database but it would be slower compared to using normal file structure. Second is to use nosql graph model DB but it is also not compatible with the algorithms we use since we continuously pull al the data into a matrix. Final approach we think is to use normal files to keep the data but it is harder to keep track and watch the changes since no query method or the editor. Hence there are different methods and the pros and cons. What ll be the your choice and why?
I'm not sure why you mention "files" and "file structure" so many times, so maybe I'm missing something, but for efficient data processing, you obviously don't want to store things in files. It is expensive to read/write data to disk and it's hard to find something to query files in a file system that is efficient and flexible.
I suppose I'd start with a product that already does recommendations:
http://mahout.apache.org/
You can pick from various algorithms to run on your data for producing recommendations.
If you want to do it yourself, maybe a hybrid approach would work? You could still use a graph database to represent relationships, but then each node/vertex could be a pointer to a document database or a relational database where a more "full" representation of the data would exist.

Best practice to implement cache

I have to implement caching for a function that processes strings of varying lenghts (a couple of bytes up to a few kilobytes). My intention is to use a database for this - basically one big table with input and output columns and an index on the input column. The cache would try to find the string in the input column and get the output column - probably one of the simplest database applications imaginable.
What database would be best for this application? A fully-featured database like mysql or a simple one like sqlite3? Or is there even a better way by not using a database?
Document-stores are made for this. I highly recommend Redis for this specific problem. It is a "key-value" store, meaning it does not have relations, it does not have schemas, all it does is map keys to values. Which sounds like just what you need.
Alternatives are MongoDB and CouchDB. Look around and see what suites you best. My recommendation stays with Redis though.
Reading: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
Joe has some good recommendations for data stores that are commonly use for caching. I would say Redis, Couchbase (not CouchDB though - it goes to disk fairly frequently/not that fast from my experience) and just plain Memcached.
MongoDB can be used for caching, but I don't think it's quite as tuned for pure caching like something like Redis is. Mongo can hit the disk quite a bit.
Also I highly recommend using time to live (TTL) as your main caching strategy. Just give a value some time to expire and then re-populate it later. It is a very hard problem to pro-actively find all instances of some data in a cache and refresh it.

How to best store a large JSON document (2+ MB) in database?

What's the best way to store large JSON files in a database? I know about CouchDB, but I'm pretty sure that won't support files of the size I'll be using.
I'm reluctant to just read them off of disk, because of the time required to read and then update them. The file is an array of ~30,000 elements, so I think storing each element separately in a traditional database would kill me when I try to select them all.
I have lots of documents in CouchDB that exceed 2megs and it handles them fine. Those limits are outdated.
The only caveat is that the default javascript view server has a pretty slow JSON parser so view generation can take a while with large documents. You can use my Python view server with a C based JSON library (jsonlib2, simplejson, yajl) or use the builtin erlang views which don't even hit JSON serialization and view generation will be plenty fast.
If you intend to access specific elements one (or several) at a time, there's no way around breaking the big JSON into traditional DB rows and columns.
If you'd like to access it in one shot, you can convert it to XML and store that in the DB (maybe even compressed - XMLs are highly compressible). Most DB engines support storing an XML object. You can then read it in one shot, and if needed, translate back to JSON, using forward-read approaches like SAX, or any other efficient XML-reading technology.
But as #therefromhere commented, you could always save it as one big string (I would again check if compressing it enhances anything).
You don't really have a variety of choices here, you can cache them in RAM using something like memcached or push them to disk reading and writing them with a databsae (RDBMS like PostgreSQL/MySQL or DOD like CouchDB). The only real alternative to these is a hybrid system of caching the most frequently accessed documents in memcached for reading which is how a lot of sites operate.
2+MB isn't a massive deal to a database and providing you have plenty of RAM they will do an intelligent enough job of caching and using your RAM effectively. Do you have a frequency pattern of when and how often these documents are accessed and how man users you have to serve?

What are the advantages of CouchDB vs an RDBMS

I've heard a lot about couchdb lately, and am confused about what it offers.
It's hard to explain all the differences in strict advantage/disadvantage form.
I would suggest playing with CouchDB a little yourself. The first thing you'll notice is that the learning curve during initial usage is totally inverted from RDBMS.
With RDBMS you spend a lot of up front time modeling your real world data to get it in to the Database. Once you've dealt with the modeling you can do all kinds of queries.
With CouchDB you just get all your data in JSON and stored in the DB in, literally, minutes. You don't need to do any normalization or anything like that, and the transport is HTTP so you have plenty of client options.
Then you'll notice a big learning curve when writing map functions and learning how the key collation works and the queries against the views you write. Once you learn them, you'll start to see how views allow you to normalize the indexes while leaving the data un-normalized and "natural".
CouchDB is a document-oriented database.
Wikipedia:
As opposed to Relational Databases, document-based databases do not store data in tables with uniform sized fields for each record. Instead, each record is stored as a document that has certain characteristics. Any number of fields of any length can be added to a document. Fields can also contain multiple pieces of data.
Advantages:
You don't waste space by leaving empty fields in documents (because they're not necessarily needed)
By providing a simple frontend for editing it is possible to quickly set up an application for maintaining data.
Fast and agile schema updates/changes
Map Reduce queries in a turing complete language of your choice. (no more sql)
Flexible Schema designs
Freeform Object Storage
Really really easy replication
Really Really easy Load-Balancing (soon)
Take a look here.
I think what better answers you is:
Just as CouchDB is not always the
right tool for the job, RDBMS's are
also not always the right answer.
CouchDB is a disk hog because it doesn't update documents -- it creates a new revision each time you update so the not-wasting-space-part because you don't have empty fields is trumped by the revisions.

Resources