NoSQL or RDBMS for audit data - database

I know that similar questions were asked in the subject, but I still haven't seen anyone that completely contained all my requests.
I would start by saying that I only have experience in RDBMS's so I'm sorry if I get anything regarding NoSQL wrong.
I'm creating a database that would hold a large amount of audit logs (about 1TB).
I'm using it for:
Fast data writing (a massive amount of audit logs is written all the time)
Search - search over the audit data (search actions performed by a certain user, at a certain time or a certain action... the database should support searching any of the 'columns' very quickly)
Analytics & Reporting - Generate daily, weekly, monthly reports of the data (They are predefined at the moment.. if they are more dynamic, does it affect the solution I should choose?)
Reliability (support for fail-over or any similar feature), Scalability (If I grow above 1TB to 2TB, 10TB or 100TB - does any of the solutions can't support this amount of data?) and of course Performance (in the use cases I specified) are very important to me.
I know RDBMS and that would be my easy way of starting, but I'm really concerned that after a while, the DB would simply not keep up with the pace.
My question is should I pick an RDBMS or NoSQL solution and why? If a NoSQL solution, since they are so different, which of them do you think fits my needs?

Generally there isn't a right or wrong answer here.
Fast data writing, either solution will be ok, although you didn't say what volume per second you are storing. Both solutions have things to watch out for.
Search (very quick) over all columns. For smaller volumes, say few hundred Gb, then either solution will be Ok (assuming skilled people put it together). You didn't actually say how fast/often you search, so if it is many times per minute this consideration becomes more important. Fast search can often slow down ability to write high volumes quickly as indexes required for search need to be updated.
Audit records typically have a time component, so searching that is time constrained, eg within last 7 days, will significantly speed up search times compared to search all records.
Reporting. When you get up to 100Tb, you are going to need some real tricks, or a big budget, to get fast reporting. For static reporting, you will probably end up creating one program that generates multiple reports at once to save I/O. Dynamic reports will be the tricky one.
My opinion? Since you know RDBMS, I would start with that as a method and ship the solution. This buys you time to learn the real problems you will encounter (the no premature optimization that many on SO are keen on). During this initial timeframe you can start to select nosql solutions and learn them. I am assuming here that you want to run your own hardware/database, if you want to use cloud type solutions, then go to them straight away.

Related

Alternative to scanning AWS DynamoDB?

I understand that scanning DynamoDB is not reccomended and is bad practice.
Let's say I have a food ordering website and I want to do a daily scan of all users to find out who hasn't ordered food in the last week so I can send them an email (just an example).
This would put some very spikey demand on the database, especially with a large user base.
Is there an alternative to these scheduled scans that I'm missing? Or in this scenario is a scan the best tool for the job?
There are a lot of different possible answers to this question. As so often all of this begins with the simple truth that the best way to do something like this, depends on the actual specifics and what you are trying to optimise for (cost, latency, duration etc.).
Since this appears to be a "once a week" thing I guess latency and "job" duration are not high on the priority list, but cost might be.
The next important thing to consider is implementation complexity. For example: if your service only has 100 users, I would not bother with any of the more complex solutions and just do a scan. But if your service has millions of users, this is probably not a great idea anymore.
For the purpose of this answer I am going to assume that your user base has become too large to just do a scan. In this scenario I can think of two possible solutions:
Add a separate index that allows you to "query" for the last order date easily.
Use a S3 backup
The first should be fairly self explanatory. As often described in DynamoDB articles, you are supposed to define your "access patterns" and build indexes around them. The pro here is that you are still operating within DynamoDB, the con is the added cost.
My preferred solution is probably to just do scheduled backups of the table to S3 and then process the backup somewhere else. Maybe a custom tool you write or some AWS service that allows processing large amounts of data. This is probably the cheapest solution but processing time might not be "super fast".
I am looking forward to other solutions to this interesting question.

How to implement sharding?

First world problems: We've got a production system that is growing rapidly, and we are aiming to grow our user base even more. At peak times our DB is flatlining at 100% CPU, which I take as an indication that it's pretty much stretched to the limit. Being an AWS instance, we could always throw some more hardware at it, but long term, it seems we will need to implement sharding.
I've Googled all over and found lots of explanations of what sharding is, why it is a good idea under certain circumstances, what design considerations, etc... but not a word on the practicality of how to do it.
What are the practical steps to shard a database? How do you redirect queries to the appropriate shard? And how do you run reports that require data from all shards?
The first thing you'll want to decide is whether or not you want to take on the complexity of routing queries in your application. If you decide to roll your own implementation, there are a number of complexities that you'll need to deal with over time.
You'll need a scheme to distribute data and queries evenly across the cluster. You'll need to ensure that this scheme is forward-compatible with a larger cluster, as if your data is already big enough to require a sharded architecture, it's likely that you'll need to add more servers.
The problem with sharding schemes is that they force you to make tradeoffs that you wouldn't have to make with a single-server database. For example, if you are sharding by user_id, any query which spans multiple users will need to be sent to all servers (or a subset of servers) and the results must be accumulated in your client application. This is especially complex if you are using aggregate queries that rely on the ordering of the data, such as MAX(), or any histogram computation.
All of this complexity isn't meant to scare you, but it's something you'll need to pay attention to. There are tools out there that can help you (disclosure: my company makes a tool called dbShards) but you can definitely put together your own solution, especially if your application is mature and the query patterns are quite predictable.

What NoSQL database should I be using?

Ok, so I've been doing a bit of research into NoSQL databases, and they seem to be the right option for what I need. The problem is however, that a lot of these databases, if not most of them are reading to/writing from RAM, as opposed to disk. That's great when you have plenty of server resources or don't expect massive data blocks - but I think I should prepare for the worst.
What I expect to receive from these data sources is anywhere from 25KB to 150KB per query - yup - up to 150KB for a single key value. The average user will produce anywhere from 500 to 5000 of these keys and they can grow infinitely (but will probably stop somewhere in that 5000 range). If you quickly do the calculations (most of the data will be on the higher end of 25-150, so I'll use 100KB as an "average", most users will probably produce 2000-3000 queries): 100KB*3000 - that's 300MB per user! An insane amount of data when you start getting a decent userbase. So, ultimately I'll probably throw away most of the data in the queries so it is no more than 1KB or so, but that will still far surpass most RAM capabilities.
So I think what I'm looking for is a solution that will store data to disk, and cache objects in RAM.. But I'm open to all solutions! Let me know what you guys think. I would love to keep this thing running fast...
Edit:
Wording it slightly differently as to be useful to a passerby:
If one is looking to maximize performance but handle large dataloads in a NoSQL database, what would be the recommended NoSQL database? I would think it would be one which stores data to disk, but this can compromise performance significantly. Is there a "best of both worlds" solution out there? It is important to note I assume, that these records would not be modified once they were submitted, only read from (but maybe not even that often).
I've been looking into Redis for such a task, because it looks very clean to manage - however it runs entirely in RAM, thus requires small data blocks, or multiple servers running multiple instances at once.. Which is something I don't have access to.
First of all, I think when you say most you've seen store data in RAM, you refer to in memory Key/Value data stores like Redis or Memcached.
But there's more than that. Before closing the discussion on in-memory NoSQL options, I should say that you are right. Memory fills up quite easily and you would need tons of it, judging from your requirements. So in-memory options should be discarded (not they're not useful, but not not in this specific situation).
My proposal is MongoDb. Does what you need: stores data on disk, caches stuff in-memory (as much as it can).
However, you need some powerful data storage options (SSD is what you should think about) so it can handle your data throughput needs. I've tested Mongo, but with far less data.
I was looking for over 1 million elements collections, with value sizes ranging from 5Kb to 50Kb.
I was mostly interested in read speeds. I should also mention write speeds, which I tested, and must say that they are impressive. One million 20Kb inserts in a few minutes (on a small server - quad core, 8GB of RAM, VMware VM).
Getting back to read speeds, I was looking for semi-concurrent queries that would give me under 50msec read times for around 100 concurrent users.
With some help from the MongoDb team I managed to get close to those times, but then I got into something else and had to drop my research (temporarily, I hope to resume it soon). There are far more things to look into, as speeds for aggregates, map/reduce, etc.
I can say that query times on the server were super fast and all the overhead was added by BSON serialization/deserialization and transport over the network.
So, for you Mongo would be appropriate, but you have to back it up with some good hardware.
You should really install it and test it in your specific situation and draw your conclusions from your own tests.
If you're going to do it and your client is .NET, then you should use their official driver. Otherwise, there are plenty others listed here: http://www.mongodb.org/display/DOCS/Drivers.
A good intro on Mongo features and how to use them can be found here: http://www.mongodb.org/display/DOCS/Developer+Zone. Granted, their documentation is not as good as the one for RavenDb (another NOSQL solution I've tested, but not nearly as fast) but you can get good support here or on Google Groups.

Best Database for remote sensor data logging

I need to choose a Database for storing data remotely from a big number (thousands to tens of thousands) of sensors that would generate around one entry per minute each.
The said data needs to be queried in a variety of ways from counting data with certain characteristics for statistics to simple outputting for plotting.
I am looking around for the right tool, I started with MySQL but I feel like it lacks the scalability needed for this project, and this lead me to noSQL databases which I don't know much about.
Which Database, either relational or not would be a good choice?
Thanks.
There is usually no "best" database since they all involve trade-offs of one kind or another. Your question is also very vague because you don't say anything about your performance needs other than the number of inserts per minute (how much data per insert?) and that you need "scalability".
It also looks like a case of premature optimization because you say you "feel like [MySQL] lacks the scalability needed for this project", but it doesn't sound like you've run any tests to confirm whether this is a real problem. It's always better to get real data rather than base an important architectural decision on "feelings".
Here's a suggestion:
Write a simple test program that inserts 10,000 rows of sample data per minute
Run the program for a decent length of time (a few days or more) to generate a sizable chunk of test data
Run your queries to see if they meet your performance needs (which you haven't specified -- how fast do they need to be? how often will they run? how complex are they?)
You're testing at least two things here: whether your database can handle 10,000 inserts per minute and whether your queries will run quickly enough once you have a huge amount of data. With large datasets these will become competing priorities since you need indexes for fast queries, but indexes will start to slow down your inserts over time. At some point you'll need to think about data archival as well (or purging, if historical data isn't needed) both for performance and for practical reasons (finite storage space).
These will be concerns no matter what database you select. From what little you've told us about your retrieval needs ("counting data with certain characteristics" and "simple outputting for plotting") it sounds like any type of database will do. It may be that other concerns are more important, such as ease of development (what languages and tools are you using?), deployment, management, code maintainability, etc.
Since this is sensor data we're talking about, you may also want to look at a round robin database (RRD) such as RRDTool to see if that approach better serves your needs.
Found this question while googling for "database for sensor data"
One of very helpful search-results (along with this SO question) was this blog:
Actually I've started a similar project (http://reatha.de) but realized too late, that I'm using not the best technologies available. My approach was similar MySQL + PHP. Finally I realized that this is not scalable and stopped the project.
Additionally, a good starting point is looking at the list of data-bases in Heroku:
If they use one, then it should be not the worst one.
I hope this helps.
you can try to use Redis noSQL database

The difficulty of choosing right database for analytics

I need some help deciding which database we should choose for our project. We are developing a web application that collects data about user's behavior and analyses that (bad explanation, but I can't provide much more detail; web analytics data is one of our core datasets). We have estimated that we will insert approx 200 million rows per week into database + data calculated from that raw data. The data must be retained for at least six months.
I have spent last week and half gathering information about different solutions, but there seems to be so many that I feel lost. Most promising ones I found are Cassandra, Hbase and Hive. I also looked at MongoDb, Redis and some others, but they looked like they suited different needs or community wasn't that active.
The whole app will be run in Amazon's EC2. As a startup company pay-as-you-go pricing model fits us like a glove. The easier the database is to manage in the cloud, the better.
Scalability is important. The amount of data we will generate varies quite much and will grow over time.
We can't pay huge licensing fees. Otherwise we would probably use something like http://www.vertica.com/.
We need to do all sorts of analysis on data, and the easier they are write the better. I thought about using Map/Reduce for the task; Hbase seems to have better support for this than Cassandra, and Hive has it's own query language. Real-time analysis isn't needed; we can calculate results once a day and shovel those back to database for fast retrieval.
Compression support would be nice, but not necessary (disk space is cheap :).
I also though about using MySql (because we will use that for all the user information etc. anyway), but scaling will be much harder in the future and I think at some point we would have to move to some other db anyway. We are also more than willing to commit some time and effort to push the selected database forward in terms of development.
We have decided to go on with Hadoop(& Hive/Hbase) as our primary data store. Main reasons for this are:
It is proven technology, and many big sites are using it (Facebook...).
Lot's of documentation around and even Hadoop books have been written.
Hive provides nice SQL-like query language and command line, so even guys who don't know Java/Python/etc. can write queries easily.
It's free and community people seem to be helpful :)

Resources