Sorry if there's already an answer for this, I searched for it and I didn't find exactly my scenario.
Once again is a question like "What is the fastest/best performance DB?". But since the answer depends on the scenario, my scenario is this: I want to write many logs to DB, thousands per second. But I will not read them often. Indeed 99,99% of them will never be read again, but once in a while I will need to read. Schema is not complex, just key/value. Once in a while I will read by value and I will not care at all if this read takes minutes. The correctness of the read will be critical, but not the performance.
So far it seems the best solutions are things like MongoDB, Cassandra... and perhaps the best DynamoDB?
Any DBMS, i would say, switch to the lowest isolation level and no index. If you put that together with a good storage system, maybe a RAID 0 with SSDs. Fastes writes ever.
Its hard to say wich DBMS is best, usually you want the best dbms that is good in doing something in particular but you need a dbms that basically just write something with the least of restriction, ive heard mysql can be great in this.
If fast writes are what you're after, you have a few options. Assuming that you will be the one to maintain the DB you can write the inserts to memory, and flush them after they get to a certain size. that way you aren't hitting the disk so many times.
If I'm not mistaken MongoDB does that already, plus if you disable the journaling it can drastically increase write performance which is exactly what you're going for.
Either way, caching and bulk inserting is the way to go with any database.
Related
I have a single machine for setting up a database. The main purpose of that database is to write/write audit log. The write/read ratio is about 9:1. Originally I planed to use MySQL. Later I heard of NoSQL DB have better performance on when "scale out". I am interested if I only have 1 single machine, will NoSQL DB (eg, cassandra or mongo DB) have better performance than MySQL?
Yes, it does.
I am using MongoDB. I tried my application on both MySQL and MongoDB. I see better performance on MongoDB.
So, in general I have noticed better performance.
But I think it depends on certain factors such as if you are using join, foreign keys etc. As you must be knowing in NoSQL there are usually no joins etc.
PS: Having said all this. It depends on how and what you are doing.
Also, if you are using MongoDB as your NoSQL Database, do check the newer version (3.0) as they say it is 10 times faster than their older version (2.6).
And if you really need a high performance database you can go for Redis.
It's a "NoSQL" key-value data store. More precisely, it is a data structure server. Not like MongoDB (which is a disk-based document store), though MongoDB could be used for similar key/value use cases. The closest analog is probably to think of Redis as Memcached, but with built-in persistence (snapshotting or journaling to disk) and more datatypes.
You will need to read a bit about it to make a decision on what is important to you.
It really depends on your usage scenario. In general, NoSQL solutions tend to provide simpler semantics which, in theory at least, can avoid some of the overhead of ACID-compliant SQL solutions. In practice though the real world performance can be affected by so many factors that the theory rarely comes into play.
On a single box, you are probably better off spending time on optimizing your code to work with whatever platform you choose.
There are "textbook" solutions for dealing with log data - ElasticSearch/Kibana/LogStash is a common one, Redis is another - are usually focused on larger scale systems and don't fit your single-machine requirement.
Depending on how much you are willing to invest, you may be able to squeeze the most IOPS out of your box when using a really thin storage mechanism such as ESENT or BerkeleyDB. Word of warning though: those are really more painful to work with.
In summary, I'll repeat myself: On a single box, you are probably better off spending time on optimizing your code to work with whatever platform you choose.
One of the in-memory NoSQL databases with persistence is the best choice for your scenario. They do write operations extremely fast because a write operation under the hood there is just a write to the end of the write ahead log file, which can be done at speed up to 100mb/sec even on magnetic disks.
Try Redis or Tarantool. The latter is better if you have a heavy concurrent write load.
Ok, so I've been doing a bit of research into NoSQL databases, and they seem to be the right option for what I need. The problem is however, that a lot of these databases, if not most of them are reading to/writing from RAM, as opposed to disk. That's great when you have plenty of server resources or don't expect massive data blocks - but I think I should prepare for the worst.
What I expect to receive from these data sources is anywhere from 25KB to 150KB per query - yup - up to 150KB for a single key value. The average user will produce anywhere from 500 to 5000 of these keys and they can grow infinitely (but will probably stop somewhere in that 5000 range). If you quickly do the calculations (most of the data will be on the higher end of 25-150, so I'll use 100KB as an "average", most users will probably produce 2000-3000 queries): 100KB*3000 - that's 300MB per user! An insane amount of data when you start getting a decent userbase. So, ultimately I'll probably throw away most of the data in the queries so it is no more than 1KB or so, but that will still far surpass most RAM capabilities.
So I think what I'm looking for is a solution that will store data to disk, and cache objects in RAM.. But I'm open to all solutions! Let me know what you guys think. I would love to keep this thing running fast...
Edit:
Wording it slightly differently as to be useful to a passerby:
If one is looking to maximize performance but handle large dataloads in a NoSQL database, what would be the recommended NoSQL database? I would think it would be one which stores data to disk, but this can compromise performance significantly. Is there a "best of both worlds" solution out there? It is important to note I assume, that these records would not be modified once they were submitted, only read from (but maybe not even that often).
I've been looking into Redis for such a task, because it looks very clean to manage - however it runs entirely in RAM, thus requires small data blocks, or multiple servers running multiple instances at once.. Which is something I don't have access to.
First of all, I think when you say most you've seen store data in RAM, you refer to in memory Key/Value data stores like Redis or Memcached.
But there's more than that. Before closing the discussion on in-memory NoSQL options, I should say that you are right. Memory fills up quite easily and you would need tons of it, judging from your requirements. So in-memory options should be discarded (not they're not useful, but not not in this specific situation).
My proposal is MongoDb. Does what you need: stores data on disk, caches stuff in-memory (as much as it can).
However, you need some powerful data storage options (SSD is what you should think about) so it can handle your data throughput needs. I've tested Mongo, but with far less data.
I was looking for over 1 million elements collections, with value sizes ranging from 5Kb to 50Kb.
I was mostly interested in read speeds. I should also mention write speeds, which I tested, and must say that they are impressive. One million 20Kb inserts in a few minutes (on a small server - quad core, 8GB of RAM, VMware VM).
Getting back to read speeds, I was looking for semi-concurrent queries that would give me under 50msec read times for around 100 concurrent users.
With some help from the MongoDb team I managed to get close to those times, but then I got into something else and had to drop my research (temporarily, I hope to resume it soon). There are far more things to look into, as speeds for aggregates, map/reduce, etc.
I can say that query times on the server were super fast and all the overhead was added by BSON serialization/deserialization and transport over the network.
So, for you Mongo would be appropriate, but you have to back it up with some good hardware.
You should really install it and test it in your specific situation and draw your conclusions from your own tests.
If you're going to do it and your client is .NET, then you should use their official driver. Otherwise, there are plenty others listed here: http://www.mongodb.org/display/DOCS/Drivers.
A good intro on Mongo features and how to use them can be found here: http://www.mongodb.org/display/DOCS/Developer+Zone. Granted, their documentation is not as good as the one for RavenDb (another NOSQL solution I've tested, but not nearly as fast) but you can get good support here or on Google Groups.
I need to choose a Database for storing data remotely from a big number (thousands to tens of thousands) of sensors that would generate around one entry per minute each.
The said data needs to be queried in a variety of ways from counting data with certain characteristics for statistics to simple outputting for plotting.
I am looking around for the right tool, I started with MySQL but I feel like it lacks the scalability needed for this project, and this lead me to noSQL databases which I don't know much about.
Which Database, either relational or not would be a good choice?
Thanks.
There is usually no "best" database since they all involve trade-offs of one kind or another. Your question is also very vague because you don't say anything about your performance needs other than the number of inserts per minute (how much data per insert?) and that you need "scalability".
It also looks like a case of premature optimization because you say you "feel like [MySQL] lacks the scalability needed for this project", but it doesn't sound like you've run any tests to confirm whether this is a real problem. It's always better to get real data rather than base an important architectural decision on "feelings".
Here's a suggestion:
Write a simple test program that inserts 10,000 rows of sample data per minute
Run the program for a decent length of time (a few days or more) to generate a sizable chunk of test data
Run your queries to see if they meet your performance needs (which you haven't specified -- how fast do they need to be? how often will they run? how complex are they?)
You're testing at least two things here: whether your database can handle 10,000 inserts per minute and whether your queries will run quickly enough once you have a huge amount of data. With large datasets these will become competing priorities since you need indexes for fast queries, but indexes will start to slow down your inserts over time. At some point you'll need to think about data archival as well (or purging, if historical data isn't needed) both for performance and for practical reasons (finite storage space).
These will be concerns no matter what database you select. From what little you've told us about your retrieval needs ("counting data with certain characteristics" and "simple outputting for plotting") it sounds like any type of database will do. It may be that other concerns are more important, such as ease of development (what languages and tools are you using?), deployment, management, code maintainability, etc.
Since this is sensor data we're talking about, you may also want to look at a round robin database (RRD) such as RRDTool to see if that approach better serves your needs.
Found this question while googling for "database for sensor data"
One of very helpful search-results (along with this SO question) was this blog:
Actually I've started a similar project (http://reatha.de) but realized too late, that I'm using not the best technologies available. My approach was similar MySQL + PHP. Finally I realized that this is not scalable and stopped the project.
Additionally, a good starting point is looking at the list of data-bases in Heroku:
If they use one, then it should be not the worst one.
I hope this helps.
you can try to use Redis noSQL database
Is there some general guidelines online on how to tweak oracle for doing a high number of inserts and low number of reads?
All the answers below are pretty good recommendations. I have to clarify the following things. I am using 10g and this is an absolute requirement that we use Oracle. I am also more interested in oracle instance parameters for tuning (perhaps some different locking policies).
Let me assume you want to do an excessive high number of inserts, so that you simply want to just ignore all other kinds of operations just to get those inserts to complete, without problems.
First, have you completely ruled out other types of databases? There are systems like industry databases that cope very well with massive amounts of inserts, typically used to receive and store data from equipment that is measuring something in a factory environment. Oracle is a relational database, it might not be the right type of software for your needs.
Having said that, let's assume you can, or will, or should, use Oracle. The very first thing you need to do is to consider all the various types of data you need to make this assumption about. If they're all about the same kind of data, you need 1 table, and it need to be lean and mean regarding inserts.
The optimal way do that is to do the following:
do not add any indexes on this table at all, if you need a primary key, that's the only index you want
if you need to do reads against this table, consider having a shadow table with indexes that you do reads, lookups, and aggregates against. If this doesn't have to be up-to-the-millisecond updated, consider a periodic batch job to update it with data from the master table. This will disturb the master table with read-locks as little as possible
Make sure your server has fast disks. Transactional write operations will typically involve the disk at some point, so make sure that's a small bottleneck as you can get.
If your application is gathering data from many incoming sources, consider adding a layer in front of the database that will keep the number of concurrent connections and thus transactions to that table to a minimum. If you get a high number of write-locks on the same page for an oracle database, ultimately your performance will suffer.
If you can split up the data, consider splitting it in such a way that it is stored on different physical disks. That way, disk I/O problems won't be cross-data-type, and only affect one type of data.
In the other end of the spectrum you have a denormalized table with lots of indexes optimized for a balance between lookups and updates, and you need to find some middle-way that will get you the performance you want.
In terms of database design put as few constraints, indexes and triggers on the table(s) you're inserting into as possible as these will all slow down the insert.
The lack of indexes will obviously hurt your SELECT performance, but it doesn't sound like this is your primary concern.
What sort of application are we talking about? What version of Oracle?
If you are designing a data warehouse load process, for example, you would generally want to do direct-path inserts into staging table(s), then build any necessary indexes, then do a partition exhange to load the data into the partitioned destination table. This doesn't work as well, of course, if you are doing single-row inserts.
Depending on the Oracle version and the type of application, you may also want to enable compression on the table. Inserts are generally cheap from a CPU standpoint, so there is probably plenty of CPU available to do the compression which can substantially decrease the amount of I/O required, which is generally going to be your bottleneck.
I'm going to suggest that you take your question to Tom Kyte's site, http://asktom.oracle.com. You can generally find an answer there. Otherwise, try Oracle's forums.
Also try looking up any of Tom Kyte's books. Suggest checking the library or your local bookstore to find the right one, to ensure that the book contains the right topics for you. Also, his blog has links to his books and some articles/discussions on each book.
I did a quick google, site:oracle.com tuning write, and found this
OracleAS TopLink Writing Optimization Features. I realize that you might not be using TopLink but it may have some good tips. Keywords you'll want to try using: tuning, performance, insert(s), improve. Also through in the technology you are using like java/c++/etc.
Other tips you can try:
using stored procedures or using them in more efficient ways.
tweaking your server's hardware. Faster hard drives or a specific RAID array, possibly more cpu's.
Ask Tom thread - some nice comments here, also links to Fowler's site
You will probably have to start running some performance analytics on your queries/implementations to find the sweet spot for each one. I wish I had an easy answer for you. Good Luck!
A couple of suggestions for you to look into further:-
direct path load
block compression
In a web app that support more than 5000 users, postgres is becoming the bottle neck.
It takes more than 1 minute to add a new user.(even after optimizations and on Win 2k3)
So, as a design issue, which other DB's might be better?
Most likely, it's not PostgreSQL, it's your design. Changing shoes most likely will not make you a better dancer.
Do you know what is causing slowness? Is it contention, time to update indexes, seek times?
Are all 5000 users trying to write to the user table at the same exact time as you are trying to insert 5001st user? That, I can believe can cause a problem. You might have to go with something tuned to handling extreme concurrency, like Oracle.
MySQL (I am told) can be optimized to do faster reads than PostgreSQL, but both are pretty ridiculously fast in terms of # transactions/sec they support, and it doesn't sound like that's your problem.
P.S.
We were having a little discussion in the comments to a different answer -- do note that some of the biggest, storage-wise, databases in the world are implemented using Postgres (though they tend to tweak the internals of the engine). Postgres scales for data size extremely well, for concurrency better than most, and is very flexible in terms of what you can do with it.
I wish there was a better answer for you, 30 years after the technology was invented, we should be able to make users have less detailed knowledge of the system in order to have it run smoothly. But alas, extensive thinking and tweaking is required for all products I am aware of. I wonder if the creators of StackOverflow could share how they handled db concurrency and scalability? They are using SQLServer, I know that much.
P.P.S.
So as chance would have it I slammed head-first into a concurrency problem in Oracle yesterday. I am not totally sure I have it right, not being a DBA, but what the guys explained was something like this: We had a large number of processes connecting to the DB and examining the system dictionary, which apparently forces a short lock on it, despite the fact that it's just a read. Parsing queries does the same thing.. so we had (on a multi-tera system with 1000s of objects) a lot of forced wait times because processes were locking each other out of the system. Our system dictionary was also excessively big because it contains a separate copy of all the information for each partition, of which there can be thousands per table. This is not really related to PostgreSQL, but the takeaway is -- in addition to checking your design, make sure your queries are using bind variables and getting reused, and pressure is minimal on shared resources.
Please change the OS under which you run Postgres - the Windows port, though immensely useful for expanding the user base, is still not on a par with the (much older and more mature) Un*x ports (and especially the Linux one).
Ithink your best choice is still PostgresSQL. Spend the time to make sure you have properly tuned your application. After your confident you have reached the limits of what can be done with tuning, start cacheing everything you can. After that, start think about moving to an asynchronous master slave setup...Also are you running OLAP type functionality on the same database your doing OLTP on?
Let me introduce you to the simplest, most practical way to scale almost any database server if the database design is truly optimal: just double your ram for an instantaneous boost in performance. It's like magic.
PostgreSQL scales better than most, if you are going to stay with a relational db, Oracle would be it. ODBMS scale better but they have their own issues, as in that it is closer to programming to set one up.
Yahoo uses PostgreSQL, that should tell you something about is scalability.
As highlighted above the problem is not with the particular database you are using, i.e. PostgreSQL but one of the following:
Schema design, maybe you need to add, remove, refine your indexes
Hardware maybe you are asking to much of your server - you said 5k users but then again very few of them are probably querying the db at the same time
Queries: perhaps poorly defined resulting in lots of inefficiency
A pragmatic way to find out what is happening is to analyse the PostgeSQL log files and find out what queries in terms of:
Most frequently executed
Longest running
etc. etc.
A quick review will tell you where to focus your efforts and you will most likely resolve your issues fairly quickly. There is no silver bullet, you have to do some homework but this will be small compared to changing your db vendor.
Good news ... there are lots of utilities to analayse your log files that are easy to use and produce easy to interpret results, here are two:
pgFouine - a PostgreSQL log analyzer (PHP)
pgFouine: Sample report
PQA (ruby)
PQA: Sample report
First, I would make sure the optimizations are, indeed, useful. For example, if you have many indexes, sometimes adding or modifying a record can take a long time.
I know there are several big projects running over PostgreSQL, so take a look at this issue.
I'd suggest looking here for information on PostgreSQL's performance: http://enfranchisedmind.com/blog/2006/11/04/postgres-for-the-win
What version of PG are you running? As the releases have progressed, performance has improved greatly.
Hi had the same issue previously with my current company. When I first joined them, their queries were huge and very slow. It takes 10 minutes to run them. I was able to optimize them to a few milliseconds or 1 to 2 seconds. I have learned many things during that time and I will share a few highlights in them.
Check your query first. doing an inner join of all the tables that you need will always take sometime. One thing that I would suggest is always start off with the table with which you can actually cut your data to those that you need.
e.g. SELECT * FROM (SELECT * FROM person WHERE person ilike '%abc') AS person;
If you look at the example above, this will cut your results to something that you know you need and you can refine them more by doing an inner join. This is one of the best way to speed up your query but there are more than one way to skin a cat. I cannot explain all of them here because there are just too many but from the example above, you just need to modify that to suite your need.
It depends on your postgres version. Older postgres does internally optimize the query. on example is that on postgres 8.2 and below, IN statements are slower than 8.4's.
EXPLAIN ANALYZE is your friend. if your query is running slow, do an explain analyze to determine which of it is causing the slowness.
Vacuum your database. This will ensure that statistics on your database will almost match the actual result. Big difference in the statistics and actual will result on your query running slow.
If all of these does not help you, try modifying your postgresql.conf. Increase the shared memory and try to experiment with the configuration to better suite your needs.
Hope this helps, but of course, these are just for postgres optimization.
btw. 5000 users are not much. My DB contains users with about 200k to a million users.
If you do want to switch away from PostgreSQL, Sybase SQL Anywhere is number 5 in terms of price/performance on the TPC-C benchmark list. It's also the lowest price option (by far) on the top 10 list, and is the only non-Microsoft and non-Oracle entry.
It can easily scale to thousands of users and terabytes of data.
Full disclosure: I work on the SQL Anywhere development team.
We need more details: What version you are using? What is the memory usage of the server? Are you vacuuming the database? Your performance problems might be un-related to PostgreSQL.
If you have many reads over writes, you may want to try MySQL assuming that the problem is with Postgres, but your problem is a write problem.
Still, you may want to look into your database design, and possibly consider sharding. For a really large database you may still have to look at the above 2 issues regardless.
You may also want to look at non-RDBMS database servers or document oriented like Mensia and CouchDB depending on the task at hand. No single tool will manage all tasks, so choose wisely.
Just out of curiosity, do you have any stored procedures that may be causing this delay?