Ganglia - RRD(round robin database) scalability - database

I've just come across RRD lately by trying out ganglia monitoring system. Ganglia stores the monitoring data in RRD. I am just wondering that, from scalability perspective, how RRD works ? What if I have potentially huge amount of data to store. Like the ganglia case, if I want to store all the historical monitoring statistics instead of just storing the data recently with a specific TTL, will RRD good enough to cope with that?
Can someone who used RRD share some experience on how does RRD scale, and how does it compare to RDBMS or even big table?

The built-in consolidation feature of rrdtool is configurable, so depending on your disk space there is no limit to the amount of high precision data you can store with rrdtool. also due to its design, rrdtool databases will never have to be vacuumed or otherwise maintained, so that you can grow the setup to staggering sizes. Obviously you need enough memory and fast disks for rrdtool to work with big data, but this is the same with any large data step.
Some people get confused about rrdtools abilities due to the fact that you can also run it on a tiny embedded system, and when these people start logging gigabytes worth of data on an old pc from the attic and find that it does not cope, they wonder ...

RRD is designed to automatically blur (average out) your data over time, such that total size of database stays roughly the same, even as new data continuously arrives.
So, it is only good if you want some historical data and are willing to lose precision over time.
In other words, you cannot really compare RRD to standard SQL databases or to Bigtable, because standard SQL and NoSQL databases all store data precisely - you will read exactly what was written.
With RRDtool, however, there is no such guarantee. But its speed makes it attractive solution for all kinds of monitoring solutions, where only most recent data matters.

Related

Where does Big Data go and how is it stored?

I'm trying to get to grips with Big Data, and mainly with how Big Data is managed.
I'm familiar with the traditional form of data management and data life cycle; e.g.:
Structured data collected (e.g. web form)
Data stored in tables in an RDBMS on a database server
Data cleaned and then ETL'd into a Data Warehouse
Data is analysed using OLAP cubes and various other BI tools/techniques
However, in the case of Big Data, I'm confused about the equivalent version of points 2 and 3, mainly because I'm unsure about whether or not every Big Data "solution" always involves the use of a NoSQL database to handle and store unstructured data, and also what the Big Data equivalent is of a Data Warehouse.
From what I've seen, in some cases NoSQL isn't always used and can be totally omitted - is this true?
To me, the Big Data life cycle goes something on the lines of this:
Data collected (structured/unstructured/semi)
Data stored in NoSQL database on a Big Data platform; e.g. HBase on MapR Hadoop distribution of servers.
Big Data analytic/data mining tools clean and analyse data
But I have a feeling that this isn't always the case, and point 3 may be totally wrong altogether. Can anyone shed some light on this?
When we talk about Big Data, we talk in most cases about huge amount of data that is many cases constantly written. Data can have a lot of variety as well. Think of a typical data source for Big Data as a machine in a production line that produces all the time sensor data on temperature, humidity, etc. Not the typical kind of data you would find in your DWH.
What would happen if you transform all this data to fit into a relational database? If you have worked with ETL a lot, you know that extracting from the source, transforming the data to fit into a schema and then to store it takes time and it is a bottle neck. Creating a schema is too slow. Also mostly this solution is to costly as you need expensive appliances to run your DWH. You would not want to fill it with sensor data.
You need fast writes on cheap hardware. With Big Data you store schemaless as first (often referred as unstructured data) on a distributed file system. This file system splits the huge data into blocks (typically around 128 MB) and distributes them in the cluster nodes. As the blocks get replicated, nodes can also go down.
If you are coming from the traditional DWH world, you are used to technologies that can work well with data that is well prepared and structured. Hadoop and co are good for looking for insights like the search for the needle in the hay stack. You gain the power to generate insights by parallelising data processing and you process huge amount of data.
Imagine you collected Terabytes of data and you want to run some analytical analysis on it (e.g. a clustering). If you had to run it on a single machine it would take hours. The key of big data systems is to parallelise execution in a shared nothing architecture. If you want to increase performance, you can add hardware to scale out horizontally. With that you speed up your search with a huge amount of data.
Looking at a modern Big Data stack, you have data storage. This can be Hadoop with a distributed file system such as HDFS or a similar file system. Then you have on top of it a resource manager that manages the access on the file system. Then again on top of it, you have a data processing engine such as Apache Spark that orchestrates the execution on the storage layer.
Again on the core engine for data processing, you have applications and frameworks such as machine learning APIs that allow you to find patterns within your data. You can run either unsupervised learning algorithms to detect structure (such as a clustering algorithm) or supervised machine learning algorithms to give some meaning to patterns in the data and to be able to predict outcomes (e.g. linear regression or random forests).
This is my Big Data in a nutshell for people who are experienced with traditional database systems.
Big data, simply put, is an umbrella term used to describe large quantities of structured and unstructured data that are collected by large organizations. Typically, the amounts of data are too large to be processed through traditional means, so state-of-the-art solutions utilizing embedded AI, machine learning, or real-time analytics engines must be deployed to handle it. Sometimes, the phrase "big data" is also used to describe tech fields that deal with data that has a large volume or velocity.
Big data can go into all sorts of systems and be stored in numerous ways, but it's often stored without structure first, and then it's turned into structured data clusters during the extract, transform, load (ETL) stage. This is the process of copying data from multiple sources into a single source or into a different context than it was stored in the original source. Most organizations that need to store and use big data sets will have an advanced data analytics solution. These platforms give you the ability to combine data from otherwise disparate systems into a single source of truth, where you can use all of your data to make the most informed decisions possible. Advanced solutions can even provide data visualizations for at a glance understanding of the information that was pulled, without the need to worry about the underlying data architecture.

What NoSQL database should I be using?

Ok, so I've been doing a bit of research into NoSQL databases, and they seem to be the right option for what I need. The problem is however, that a lot of these databases, if not most of them are reading to/writing from RAM, as opposed to disk. That's great when you have plenty of server resources or don't expect massive data blocks - but I think I should prepare for the worst.
What I expect to receive from these data sources is anywhere from 25KB to 150KB per query - yup - up to 150KB for a single key value. The average user will produce anywhere from 500 to 5000 of these keys and they can grow infinitely (but will probably stop somewhere in that 5000 range). If you quickly do the calculations (most of the data will be on the higher end of 25-150, so I'll use 100KB as an "average", most users will probably produce 2000-3000 queries): 100KB*3000 - that's 300MB per user! An insane amount of data when you start getting a decent userbase. So, ultimately I'll probably throw away most of the data in the queries so it is no more than 1KB or so, but that will still far surpass most RAM capabilities.
So I think what I'm looking for is a solution that will store data to disk, and cache objects in RAM.. But I'm open to all solutions! Let me know what you guys think. I would love to keep this thing running fast...
Edit:
Wording it slightly differently as to be useful to a passerby:
If one is looking to maximize performance but handle large dataloads in a NoSQL database, what would be the recommended NoSQL database? I would think it would be one which stores data to disk, but this can compromise performance significantly. Is there a "best of both worlds" solution out there? It is important to note I assume, that these records would not be modified once they were submitted, only read from (but maybe not even that often).
I've been looking into Redis for such a task, because it looks very clean to manage - however it runs entirely in RAM, thus requires small data blocks, or multiple servers running multiple instances at once.. Which is something I don't have access to.
First of all, I think when you say most you've seen store data in RAM, you refer to in memory Key/Value data stores like Redis or Memcached.
But there's more than that. Before closing the discussion on in-memory NoSQL options, I should say that you are right. Memory fills up quite easily and you would need tons of it, judging from your requirements. So in-memory options should be discarded (not they're not useful, but not not in this specific situation).
My proposal is MongoDb. Does what you need: stores data on disk, caches stuff in-memory (as much as it can).
However, you need some powerful data storage options (SSD is what you should think about) so it can handle your data throughput needs. I've tested Mongo, but with far less data.
I was looking for over 1 million elements collections, with value sizes ranging from 5Kb to 50Kb.
I was mostly interested in read speeds. I should also mention write speeds, which I tested, and must say that they are impressive. One million 20Kb inserts in a few minutes (on a small server - quad core, 8GB of RAM, VMware VM).
Getting back to read speeds, I was looking for semi-concurrent queries that would give me under 50msec read times for around 100 concurrent users.
With some help from the MongoDb team I managed to get close to those times, but then I got into something else and had to drop my research (temporarily, I hope to resume it soon). There are far more things to look into, as speeds for aggregates, map/reduce, etc.
I can say that query times on the server were super fast and all the overhead was added by BSON serialization/deserialization and transport over the network.
So, for you Mongo would be appropriate, but you have to back it up with some good hardware.
You should really install it and test it in your specific situation and draw your conclusions from your own tests.
If you're going to do it and your client is .NET, then you should use their official driver. Otherwise, there are plenty others listed here: http://www.mongodb.org/display/DOCS/Drivers.
A good intro on Mongo features and how to use them can be found here: http://www.mongodb.org/display/DOCS/Developer+Zone. Granted, their documentation is not as good as the one for RavenDb (another NOSQL solution I've tested, but not nearly as fast) but you can get good support here or on Google Groups.

Best Database for remote sensor data logging

I need to choose a Database for storing data remotely from a big number (thousands to tens of thousands) of sensors that would generate around one entry per minute each.
The said data needs to be queried in a variety of ways from counting data with certain characteristics for statistics to simple outputting for plotting.
I am looking around for the right tool, I started with MySQL but I feel like it lacks the scalability needed for this project, and this lead me to noSQL databases which I don't know much about.
Which Database, either relational or not would be a good choice?
Thanks.
There is usually no "best" database since they all involve trade-offs of one kind or another. Your question is also very vague because you don't say anything about your performance needs other than the number of inserts per minute (how much data per insert?) and that you need "scalability".
It also looks like a case of premature optimization because you say you "feel like [MySQL] lacks the scalability needed for this project", but it doesn't sound like you've run any tests to confirm whether this is a real problem. It's always better to get real data rather than base an important architectural decision on "feelings".
Here's a suggestion:
Write a simple test program that inserts 10,000 rows of sample data per minute
Run the program for a decent length of time (a few days or more) to generate a sizable chunk of test data
Run your queries to see if they meet your performance needs (which you haven't specified -- how fast do they need to be? how often will they run? how complex are they?)
You're testing at least two things here: whether your database can handle 10,000 inserts per minute and whether your queries will run quickly enough once you have a huge amount of data. With large datasets these will become competing priorities since you need indexes for fast queries, but indexes will start to slow down your inserts over time. At some point you'll need to think about data archival as well (or purging, if historical data isn't needed) both for performance and for practical reasons (finite storage space).
These will be concerns no matter what database you select. From what little you've told us about your retrieval needs ("counting data with certain characteristics" and "simple outputting for plotting") it sounds like any type of database will do. It may be that other concerns are more important, such as ease of development (what languages and tools are you using?), deployment, management, code maintainability, etc.
Since this is sensor data we're talking about, you may also want to look at a round robin database (RRD) such as RRDTool to see if that approach better serves your needs.
Found this question while googling for "database for sensor data"
One of very helpful search-results (along with this SO question) was this blog:
Actually I've started a similar project (http://reatha.de) but realized too late, that I'm using not the best technologies available. My approach was similar MySQL + PHP. Finally I realized that this is not scalable and stopped the project.
Additionally, a good starting point is looking at the list of data-bases in Heroku:
If they use one, then it should be not the worst one.
I hope this helps.
you can try to use Redis noSQL database

How do the newer database models achieve better scalability and performance as compared to a traditional RDBMS implementation?

We have
BigTable from Google,
Hadoop, actively contributed by Yahoo,
Dynamo from Amazon
all aiming towards one common goal - making data management as scalable as possible.
By scalability what I understand is that the cost of the usage should not go up drastically when the size of data increases.
RDBMS's are slow when the amount of data is large as the number of indirections invariable increases leading to more IO's.
How do these custom scalable friendly data management systems solve the problem?
This is a figure from this document explaining Google BigTable:
Looks the same to me. How is the ultra-scalability achieved?
The "traditional" SQL DBMS market really means a very small number of products, which have traditionally targeted business applications in a corporate setting. Massive shared-nothing scalability has not historically been a priority for those products or their customers. So it is natural that alternative products have emerged to support internet scale database applications.
This has nothing to do with the fact that these new products are not "Relational" DBMSs. The relational model can scale just as well as any other model. Arguably the relational model suits these types of massively scalable applications better than say, network (graph based) models. It's just that the SQL language has a lot of disadvantages and no-one has yet come up with suitable relational NOSQL (non-SQL) alternatives.
Speaking specifically to your question about Bigtable, the difference is that the heirarchy in the diagram above is all there is. Each Bigtable tabletserver is responsible for a set of tablets (contiguous row ranges from a table); the mapping from row range to tablet is maintained in the metadata table, while the mapping from tablet to tabletserver is maintained in the memory of the Bigtable master. Looking up a row, or range of rows, requires looking up the metadata entry (which will almost certainly be in memory on the server that hosts it), then using that to look up the actual row on the server responsible for it - resulting in only one, or a few disk seeks.
In a nutshell, the reason this scales well is because it's possible to throw more hardware at it: given enough resources, the metadata is always in memory, and thus there's no need to go to disk for it, only for the data (and not always for that, either!).
It's about using cheap comodity hardware to build a network/grid/cloud and spread the data and load (for example using map/reduce).
RDBMS databases seem to me like software being (originaly) designed to run on one supercomputer. You can use various hard drive arrays, DB clusters, but still..
The amount of data increased so there's one more reason to design new data storages with this in mind - scalability, high availability, terabytes of data.
Another thing - if you build a grid/cloud from cheap servers, it's fault tolerant because you store all data at three (?) different locations and at the same time it's cheap.
Back to your pictures - the first one is from one computer (typically), the second one from a network of computers.
One theoretical answer on scalability is at http://queue.acm.org/detail.cfm?id=1394128 - the ACID guarantees are expensive. See http://database.cs.brown.edu/papers/stonebraker-cacm2010.pdf for a counter-argument.
In fact just surviving power failures is expensive. Years ago now I compared MySQL against Oracle. MySQL was almost unbelieveably faster than Oracle, but we couldn't use it. MySQL of those days was built on top of Berkeley
DB, which was miles faster than Oracle's full blown log-based database, but if the power went off while Berkely DB based MySQL was running, it was a manual process to get the database consistent again when the power went back on, and you'ld probably lose recent updates for good.

Very large database, very small portion most being retrieved in real time

I have an interesting database problem. I have a DB that is 150GB in size. My memory buffer is 8GB.
Most of my data is rarely being retrieved, or mainly being retrieved by backend processes. I would very much prefer to keep them around because some features require them.
Some of it (namely some tables, and some identifiable parts of certain tables) are used very often in a user facing manner
How can I make sure that the latter is always being kept in memory? (there is more than enough space for these)
More info:
We are on Ruby on rails. The database is MYSQL, our tables are stored using INNODB. We are sharding the data across 2 partitions. Because we are sharding it, we store most of our data using JSON blobs, while indexing only the primary keys
Update 2
The tricky thing is that the data is actually being used for both backend processes as well as user facing features. But they are accessed far less often for the latter
Update 3
Some people are commenting than 8Gb is toy these days. I agree, but just increasing the size of the db is pure LAZINESS if there is a smarter, efficient solution
This is why we have Data Warehouses. Separate the two things into either (a) separate databases or (b) separate schema within one database.
Data that is current, for immediate access, being updated.
Data that is historical fact, for analysis, not being updated.
150Gb is not very big and a single database can handle your little bit of live data and your big bit of history.
Use a "periodic" ETL process to get things out of active database, denormalize into a star schema and load into the historical data warehouse.
If the number of columns used in the customer facing tables are small you can make indexes with all the columns being used in the queries. This doesn't mean that all the data stays in memory but it can make the queries much faster. Its trading space for response time.
This calls for memcached! I'd recommend using cache-money, a great ActiveRecord write-through caching library. The ngmoco branch has support for enabling caching per-model, so you could only cache those things you knew you wanted to keep in memory.
You could also do the caching by hand using $cache.set/get/expire calls in controller actions or model hooks.
With MySQL, proper use of the Query Cache will keep frequently queried data in memory. You can provide a hint to MySQL not to cache certain queries (e.g. from the backend processes) with the SQL_NO_CACHE keyword.
If the backend processes are accessing historical data, or accessing data for reporting purposes, certainly follow S. Lott's suggestion to create a separate data warehouse and query that instead. If a data warehouse is too much to accomplish in the short term, you can replicate your transactional database to a different server and perform queries there (a Data Warehouse gives you MUCH more flexibility and capability, so go down that path if possible)
UPDATE:
See documentation of SELECT and scroll down to SQL_NO_CACHE.
Read about the Query Cache
Ensure query_cache_type set appropriate for your needs.
UPDATE 2:
I confirmed with MySQL support that there is no mechanism to selectively cache certain tables etc. in the innodb buffer pool.
So, what is the problem?
First, 150gb is not very large today. It was 10 years ago.
Second any non-total-crap database system will utilize your memory as cache. If the cache is big enough (compared to the amount of data that is in use) it will be efficient. If not, the only thing you CAN do is get more memory (because, sorry, 8gb of memory is VERY low for a modern server - it was low 2 years ago).
You should not have to do anything for the memory to be efficiently used. At least not on a commercial level database - maybe mysql sucks, but I would not assume this.

Resources