Has anyone had any experience with MonetDB? Currently, I have a MySQL database that is growing too large, and queries are getting too slow. According to column-oriented paradigm, insertions will be slower (which I don't mind at all), but data retrieval becomes very fast. Do I stand a chance of getting more data retrieval performance just by switching to MonetDB? Is it MonetDB mature enough?
You have a chance of improving the performance of your application. The gain is, however, largely dependent on your workload, the size of your database and your hardware. MonetDB is developed/tuned under two main assumptions:
Your workload is analytical, i.e., you have lots of (grouped) aggregations and the like.
Even more important: your hot dataset (the data that you actually work with) fits into the main memory of your system. MonetDB does not have it's own Buffer Manager but relies on the OS to handle disk I/O. Since the OS (especially windows but Linux too) is sometimes very dumb about disk swapping that may become a problem (especially for joins that run out of memory).
As for the maturity, there are probably more opinions on that than people inhabiting this planet. Personally, I find it mature enough but I am a member of the development team and, thus, biased. But MonetDB is a research project so if you have an interesting application we'd love to hear about it and see if we can help.
The answer of course depends on your payload but my experience so far would seem to indicate that about everything is faster in MonetDB than I've seen in MySQL. The exception would be joins, which not only seem slow, but seem completely inept at pipelining so you end up needing gobs of memory to process large ones. That said my experience with joins in MySQL hasn't exactly been stellar either, so I'm guessing your expectations may be low. If you really want good join performance, I'd probably recommend SQL Server or the like; for those other queries you mention in the follow up comments, MonetDB should be awesome.
For instance, given a table with about 2 million rows in it, I was able to range on one column (wherin there were about 800K rows in the range) and order by another column and the limited result was processed and returned in 25ms. Performance of those type of queries does seem to degrade with scale, but that should give you a taste for what you might expect at that scale.
I should caution that the optimistic concurrency model might throw off those that have only been exposed to pessimistic concurrency (most people). I'd research it before wondering why some of your commits fail under concurrent load.
Related
I have a single machine for setting up a database. The main purpose of that database is to write/write audit log. The write/read ratio is about 9:1. Originally I planed to use MySQL. Later I heard of NoSQL DB have better performance on when "scale out". I am interested if I only have 1 single machine, will NoSQL DB (eg, cassandra or mongo DB) have better performance than MySQL?
Yes, it does.
I am using MongoDB. I tried my application on both MySQL and MongoDB. I see better performance on MongoDB.
So, in general I have noticed better performance.
But I think it depends on certain factors such as if you are using join, foreign keys etc. As you must be knowing in NoSQL there are usually no joins etc.
PS: Having said all this. It depends on how and what you are doing.
Also, if you are using MongoDB as your NoSQL Database, do check the newer version (3.0) as they say it is 10 times faster than their older version (2.6).
And if you really need a high performance database you can go for Redis.
It's a "NoSQL" key-value data store. More precisely, it is a data structure server. Not like MongoDB (which is a disk-based document store), though MongoDB could be used for similar key/value use cases. The closest analog is probably to think of Redis as Memcached, but with built-in persistence (snapshotting or journaling to disk) and more datatypes.
You will need to read a bit about it to make a decision on what is important to you.
It really depends on your usage scenario. In general, NoSQL solutions tend to provide simpler semantics which, in theory at least, can avoid some of the overhead of ACID-compliant SQL solutions. In practice though the real world performance can be affected by so many factors that the theory rarely comes into play.
On a single box, you are probably better off spending time on optimizing your code to work with whatever platform you choose.
There are "textbook" solutions for dealing with log data - ElasticSearch/Kibana/LogStash is a common one, Redis is another - are usually focused on larger scale systems and don't fit your single-machine requirement.
Depending on how much you are willing to invest, you may be able to squeeze the most IOPS out of your box when using a really thin storage mechanism such as ESENT or BerkeleyDB. Word of warning though: those are really more painful to work with.
In summary, I'll repeat myself: On a single box, you are probably better off spending time on optimizing your code to work with whatever platform you choose.
One of the in-memory NoSQL databases with persistence is the best choice for your scenario. They do write operations extremely fast because a write operation under the hood there is just a write to the end of the write ahead log file, which can be done at speed up to 100mb/sec even on magnetic disks.
Try Redis or Tarantool. The latter is better if you have a heavy concurrent write load.
Today I found an article online discussing Facebooks architecture (though it's a bit dated). While reading it I noticed under the section Software that helps Facebook scale, the third bullet point states:
Facebook uses MySQL, but primarily as a key-value persistent storage,
moving joins and logic onto the web servers since optimizations are
easier to perform there (on the “other side” of the Memcached layer).
Why move complex joins to the web server? Aren't databases optimized to perform join logic? This methodology seems contrary to what I've learned up to this point, so maybe the explanation is just eluding me.
If possible, could someone explain this (an example would help tremendously) or point me to a good article (or two) for the benefits (and possibly examples) of how and why you'd want to do this?
I'm not sure about Facebook, but we have several applications where we follow a similar model. The basis is fairly straightforward.
The database contains huge amounts of data. Performing joins at the database level really slows down any queries we make on the data, even if we're only returning a small subset. (Say 100 rows of parent data, and 1000 rows of child data in a parent-child relationship for example)
However, using .NET DataSet objects, of we select in the rows we need and then create DataRelation objects within the DataSet, we see a dramatic boost in performance.
I can't answer why this is, as I'm not knowledgeable about the internal workings of either, but I can venture a guess...
The RDBMS (Sql Server in our case) has to deal with the data that lives in files. These files are very large, and only so much of it can be loaded into memory, even on our heavy-hitter SQL Servers, so it there is a penalty of disk I/O.
When we load a small portion of it into a Dataset, the join is happening entirely in memory, so we lose the I/O penalty of going to the disk.
Even though I can't explain the reason for the performance boost completely (and I'd love to have someone more knowledgeable tell me if my guess is right) I can tell you that in certain cases, when there is a VERY large amount of data, but your app only needs to pull a small subset of it, there is a noticeable boot in performance by following the model described. We've seen it turn apps that just crawl into lightning-quick apps.
But if done improperly, there is a penalty - if you overload the machine's RAM but doing it inappropriately or in every situation, then you'll have crashes or performance issues as well.
I have a question, just looking for suggestions here.
So, my application is 'modernizing' a desktop application by converting it to the web, with an ICEFaces UI and server side written in Java. However, they are keeping around the same Oracle database, which at current count has about 700-900 tables and probably a billion total records in the tables. Some individual tables have 250 million rows, many have over 25 million.
Needless to say, the database is not scaling well. As a result, the performance of the application is looking to be abysmal. The architects / decision makers-that-be have all either refused or are unwilling to restructure the persistence. So, basically we are putting a fresh coat of paint on a functional desktop application that currently serves most user needs and does so with relative ease. The actual database performance is pretty slow in the desktop app now. The quick performance I referred to earlier was non-database related stuff (sorry I misspoke there). I am having trouble sleeping at night thinking of how poorly this application is going to perform and how difficult it is going to be for everyday users to do their job.
So, my question is, what options do I have to mitigate this impending disaster? Is there some type of intermediate layer I can put in between the database and the Java code to speed up performance while at the same time keeping the database structure intact? Caching is obviously an option, but I don't see that as being a cure-all. Is it possible to layer a NoSQL DB in between or something?
I don't understand how to reconcile two things you said.
Needless to say, the database is not scaling well
and
currently serves most user needs and does so with relative ease and quick performance.
You don't say you are adding new users or new function, just making the same function accessible via a web interface.
So why is there a problem. Your Web App will be doing more or less the same database work as before.
In fact introducing a web tier could well give new caching opportunities so reducing the work the DB is doing.
If your early pieces of web app development are showing poor performance then I would start by trying to understand how the queries you are doing in the web app differ from those done by the existing app. Is it possible that you are using some tooling which is taking a somewhat naive approach to generating queries?
If the current app performs well and your new java app doesn't, the problem is not in the database layer, but in your application layer. If performance is as bad as you say, they should notice fairly early and have the option of going back to the Desktop application.
The DBA should be able to readily identify the additional workload on the database from your application. Assuming the logic hasn't changed it is unlikely to be doing more writes. It could be reads or it could be 'chattier' (moving the same amount of information but in smaller parcels). Chatty applications can use a lot of CPU. A lot of architects try to move processing from the database layer into the application layer because "work on the database is expensive" but actually make things worse due to the overhead of the "to-and-fro".
PS.
There's nothing 'bad' about having 250 million rows in a table. Generally you access a table through an index. There are typically 2 or 3 hops from the top of an index to the bottom (and then one more to the table). I've got a 20 million row table with a BLEVEL of 2 and a 120+ million row table with a BLEVEL of 3.
Indexing means that you rarely hit more than a small proportion of your data blocks. The frequently used index blocks (and data blocks) get cached in the database server's memory. The DBA would be able to see if this memory area is too small for the workload (ie a lot of physical disk IO).
If your app is getting a lot of information that it doesn't really need, this can put pressure on the memory space. Don't be greedy. if you only need three columns from a row, don't grab the whole row.
What you describe is something that Oracle should be capable of handling very easily if you have the right equipment and database design. It should scale well if you get someone on your team who is a specialist in performance tuning large applications.
Redoing the database from scratch would cost a fortune and would introduce new bugs and the potential for loss of critical information is huge. It almost never is a better idea to rewrite the database at this point. Usually those kinds of projects fail miserably after costing the company thousands or even millions of dollars. Your architects made the right choice. Learn to accept that what you want isn't always the best way. The data is far more important to the company than the app. There are many reasons why people have learned not to try to redesign the database from scratch.
Now there are ways to improve database performance. First thing I would consider with a database this size is partioning the data. I would also consider archiving old data to a data warehouse and doing most reporting from that. Other things to consider would be improving your servers to higher performing models, profiling to find slowest running queries and individually fixing them, looking at indexing, updating statistics and indexes (not sure if this is what you do on Oracle, I'm a SLQ Server gal but your dbas would know). There are some good books on refactoring old legacy databases. The one below is not datbase specific.
http://www.amazon.com/Refactoring-Databases-Evolutionary-Database-Design/dp/0321293533/ref=sr_1_1?ie=UTF8&s=books&qid=1275577997&sr=8-1
There are also some good books on performance tuning (look for ones specific to Oracle, what works for SQL Server or mySQL is not what is best for Oracle)
Personally I would get those and read them from cover to cover before designing a plan for how you are going to fix the poor performance. I would also include the DBAs in all your planning, they know things that you do not about the database and why some things are designed the way they are.
If you have a lot of lookups that are for items not in the database you can reduce the number by using a bloom filter. Add everything in the database to the bloom filter then before you do a lookup check the bloom first. Only if the bloom reports it present do you need to bother the database. The bloom will result in false positives but you can design it to the 'size vs false positive' trade off that best suits you.
The strategy is used by Google in their big-table database and they have reported that it significantly improves performance.
http://en.wikipedia.org/wiki/Bloom_filter
Good luck, working on tasks you don't believe in is tough.
So you put a fresh coat of paint on a functional and quick desktop application and then the system becomes slow?
And then you say that "it is needless to say that the database isn't scaling well"?
I don't get it. I think that there is something wrong with your fresh coat of paint, not with the database.
Don't be put down by this sort of thing. See it as a challenge, rather than something to be losing sleep over! I know it's tempting as a programmer to want to rip everything out and start over again, but from a business perspective, it's just not always viable. For example, by using the same database, the business can continue to use the old application while the new one is being developed and switch over customers in groups, rather than having to switch everyone over at the same time.
As for what you can do about performance, it depends a lot on the usage pattern. Caching can help greatly with mostly read-only databases. Even with read/write database, it can still be a boon if correctly designed. A NoSQL database might help with write-heavy stuff, but it might also be more trouble than it's worth if the data has to end up in a regular database anyway.
In the end, it all depends greatly on your application's architecture and usage patterns.
Good luck!
Well without knowing too much about what kinds of queries that are mostly done (I would expact lookups to be more common) perhaps you should try caching first. And cache at different layers, at the layer before the app server if possible and of course what you suggested caching at the layer between the app server and the database.
Caching works well for read data and it might not be as bad as you think.
Have you looked at Terracotta ? They do have some caching and scaling stuff that might be relavant to you.
Take it as a challenge!
The way to 'mitigate this impending disaster' is to do what you should be doing anyway. If you follow best practices the pain of switching out your persistence layer at a later stage will be minimal.
Up until the time that you have valid performance benchmarks and identified bottlenecks in the system talk of performance is premature. In any case I would be surprised if many of the 'intermediate layer' strategies aren't already implemented at the database level.
If the database is legacy and enormous, then
1) it cannot be changed in a way that will change the interface, as this will break too many existing applications. Or, if you change the interface, this has to be coordinated with modifying multiple applications with associated testing.
2) If the issue is performance, then there are probably many changes that can be made to optimize the database without changing the interface.
3) Views can be used to maintain the existing interfaces while restructuring tables for more efficiency, or possibly to allow more efficient access in the future.
4) Standard database optimizations, such as performance analysis, indexing, caching can probably greatly increase efficiency and performance without changing the interface.
There's a lot more that can be done, but you get the idea. It can't really be updated in one single big change. Changes have to be incremental, or transparent to the applications that use it.
The database is PART of the application. Don't consider them to be separate, it isn't.
As developer, you need to be free to make schema changes as necessary, and suggest data changes to improve performance / functionality in production (for example archiving old data).
Your development system presumably does not have that much data, but has the exact same schema.
In order to do performance testing, you will need a system with the same hardware and same size data (same data if possible) as production. You should explain to management that performance testing is absolutely necessary as you feel the app isn't going to perform.
Of course making schema changes (adding / removing indexes, splitting tables out etc) may affect other parts of the system - which you should consider as parts of a SYSTEM - and hence do the necessary regression testing and fixing.
If you need to modify the database schema, and make changes to the desktop client accordingly, to make the web app perform, that is what you have to do - justify your design decision to the management.
I am currently into a performance tuning exercise. The application is DB intensive with very little processing logic. The performance tuning is around the way DB calls are made and the DB itself.
We did the query tuning, We put the missing indexes, We reduced or eliminated DB calls where possible. The application is performing very well and all is fine.
With smaller data volume (say upto 100,000 records), the performance is fantastic. My Question is, what needs to be done to ensure such good performance at higher data volumes ?
The data volumes are expected to reach 10 million records.
I can think of table and index partitioning, suggesting filesystems optimized for DB storage and periodic archiving to keep the number of rows in check. I would like to know what else could be done. Any tips/strategies/patterns would be very helpful.
Monitoring. Use some tools to monitor performance, and saturation of CPU, memory, and I/O. Make trend lines so you know where your next bottleneck will be before you get there.
Testing. Create mock data so you have 10 million rows on a testing server today. Benchmark the queries you have in your application and see how well they perform as the volume of data increases. You might be surprised at what breaks down first, or it may go exactly as predicted. The point is that you can find out.
Maintenance. Make sure your application and infrastructure support some downtime, because that's always necessary. You might have to defrag and rebuild your indexes. You might have to refactor some of the table structure. You might have to upgrade the server software or apply patches. To do this without interrupting continuous operation, you'll need some redundancy built in to the design.
Research. Find the best journals and blogs for the database brand you're using, and read them (e.g. http://www.mysqlperformanceblog.com if you use MySQL). You can ask good questions like the one you ask here, but also read what other people are asking, and what they're being advised to do about it. You can learn solutions to problems that you don't even have yet, so that when you have them, you'll have some strategies to employ.
Different databases need to be tuned in different ways. What RDBMS are you using?
Also, how do you know whether or not what you have done so far will result in poor performance with larger data sets? Have you tested your current optimisations with a large amount of test data?
When you did this, how did the performance change? If you are able to tune the database so that it performs with the data it has now, there's no reason to think that your methods won't work with a larger data set.
Depending on the RDBMS, the next type of solution is simple: get bigger, beefier hardware. More RAM, more disks, more CPUs.
You are on the right way:
1) Proper indexes
2) DBMS options tuning (memory caches, buffers, internal threads control and so on)
3) Queries tuning (especially log slow queries and then tune/rewrite them)
4) To tune your queries and indexes you may need to research your queries execution plans
5) Poweful dedicated server
6) Think about queries which your client applications send to the database. Are they always necessary? Do you need all the data you ask for? Is it possible to cache some data?
10 million records is probably too small to bother with partitioning. Typically partitioning will only be interesting if your data volumes are an order or magnitude or so bigger than that.
Index tuning for a database with 100,000 rows will probably get you 99% of what you need with 10 million rows. Keep an eye out for table scans or index range scans on the large tables in the system. On smaller tables they are fine and in some cases even optimal.
Archiving old data may help but this is probably overkill for 10 million rows.
One possible optimisation is to move reporting off onto a separate server. This will reduce the burden on the server - reports are often quite anti-social when run on operational systems as the schema tends not to be well optimised for it.
You can use database replication to do this or make a data mart for reporting. Replication is easier to implement but the reports will be less efficient, no more efficient than they were on the production system. Building a star schema data mart will be more efficient for reporting but incur additional development work.
In a web app that support more than 5000 users, postgres is becoming the bottle neck.
It takes more than 1 minute to add a new user.(even after optimizations and on Win 2k3)
So, as a design issue, which other DB's might be better?
Most likely, it's not PostgreSQL, it's your design. Changing shoes most likely will not make you a better dancer.
Do you know what is causing slowness? Is it contention, time to update indexes, seek times?
Are all 5000 users trying to write to the user table at the same exact time as you are trying to insert 5001st user? That, I can believe can cause a problem. You might have to go with something tuned to handling extreme concurrency, like Oracle.
MySQL (I am told) can be optimized to do faster reads than PostgreSQL, but both are pretty ridiculously fast in terms of # transactions/sec they support, and it doesn't sound like that's your problem.
P.S.
We were having a little discussion in the comments to a different answer -- do note that some of the biggest, storage-wise, databases in the world are implemented using Postgres (though they tend to tweak the internals of the engine). Postgres scales for data size extremely well, for concurrency better than most, and is very flexible in terms of what you can do with it.
I wish there was a better answer for you, 30 years after the technology was invented, we should be able to make users have less detailed knowledge of the system in order to have it run smoothly. But alas, extensive thinking and tweaking is required for all products I am aware of. I wonder if the creators of StackOverflow could share how they handled db concurrency and scalability? They are using SQLServer, I know that much.
P.P.S.
So as chance would have it I slammed head-first into a concurrency problem in Oracle yesterday. I am not totally sure I have it right, not being a DBA, but what the guys explained was something like this: We had a large number of processes connecting to the DB and examining the system dictionary, which apparently forces a short lock on it, despite the fact that it's just a read. Parsing queries does the same thing.. so we had (on a multi-tera system with 1000s of objects) a lot of forced wait times because processes were locking each other out of the system. Our system dictionary was also excessively big because it contains a separate copy of all the information for each partition, of which there can be thousands per table. This is not really related to PostgreSQL, but the takeaway is -- in addition to checking your design, make sure your queries are using bind variables and getting reused, and pressure is minimal on shared resources.
Please change the OS under which you run Postgres - the Windows port, though immensely useful for expanding the user base, is still not on a par with the (much older and more mature) Un*x ports (and especially the Linux one).
Ithink your best choice is still PostgresSQL. Spend the time to make sure you have properly tuned your application. After your confident you have reached the limits of what can be done with tuning, start cacheing everything you can. After that, start think about moving to an asynchronous master slave setup...Also are you running OLAP type functionality on the same database your doing OLTP on?
Let me introduce you to the simplest, most practical way to scale almost any database server if the database design is truly optimal: just double your ram for an instantaneous boost in performance. It's like magic.
PostgreSQL scales better than most, if you are going to stay with a relational db, Oracle would be it. ODBMS scale better but they have their own issues, as in that it is closer to programming to set one up.
Yahoo uses PostgreSQL, that should tell you something about is scalability.
As highlighted above the problem is not with the particular database you are using, i.e. PostgreSQL but one of the following:
Schema design, maybe you need to add, remove, refine your indexes
Hardware maybe you are asking to much of your server - you said 5k users but then again very few of them are probably querying the db at the same time
Queries: perhaps poorly defined resulting in lots of inefficiency
A pragmatic way to find out what is happening is to analyse the PostgeSQL log files and find out what queries in terms of:
Most frequently executed
Longest running
etc. etc.
A quick review will tell you where to focus your efforts and you will most likely resolve your issues fairly quickly. There is no silver bullet, you have to do some homework but this will be small compared to changing your db vendor.
Good news ... there are lots of utilities to analayse your log files that are easy to use and produce easy to interpret results, here are two:
pgFouine - a PostgreSQL log analyzer (PHP)
pgFouine: Sample report
PQA (ruby)
PQA: Sample report
First, I would make sure the optimizations are, indeed, useful. For example, if you have many indexes, sometimes adding or modifying a record can take a long time.
I know there are several big projects running over PostgreSQL, so take a look at this issue.
I'd suggest looking here for information on PostgreSQL's performance: http://enfranchisedmind.com/blog/2006/11/04/postgres-for-the-win
What version of PG are you running? As the releases have progressed, performance has improved greatly.
Hi had the same issue previously with my current company. When I first joined them, their queries were huge and very slow. It takes 10 minutes to run them. I was able to optimize them to a few milliseconds or 1 to 2 seconds. I have learned many things during that time and I will share a few highlights in them.
Check your query first. doing an inner join of all the tables that you need will always take sometime. One thing that I would suggest is always start off with the table with which you can actually cut your data to those that you need.
e.g. SELECT * FROM (SELECT * FROM person WHERE person ilike '%abc') AS person;
If you look at the example above, this will cut your results to something that you know you need and you can refine them more by doing an inner join. This is one of the best way to speed up your query but there are more than one way to skin a cat. I cannot explain all of them here because there are just too many but from the example above, you just need to modify that to suite your need.
It depends on your postgres version. Older postgres does internally optimize the query. on example is that on postgres 8.2 and below, IN statements are slower than 8.4's.
EXPLAIN ANALYZE is your friend. if your query is running slow, do an explain analyze to determine which of it is causing the slowness.
Vacuum your database. This will ensure that statistics on your database will almost match the actual result. Big difference in the statistics and actual will result on your query running slow.
If all of these does not help you, try modifying your postgresql.conf. Increase the shared memory and try to experiment with the configuration to better suite your needs.
Hope this helps, but of course, these are just for postgres optimization.
btw. 5000 users are not much. My DB contains users with about 200k to a million users.
If you do want to switch away from PostgreSQL, Sybase SQL Anywhere is number 5 in terms of price/performance on the TPC-C benchmark list. It's also the lowest price option (by far) on the top 10 list, and is the only non-Microsoft and non-Oracle entry.
It can easily scale to thousands of users and terabytes of data.
Full disclosure: I work on the SQL Anywhere development team.
We need more details: What version you are using? What is the memory usage of the server? Are you vacuuming the database? Your performance problems might be un-related to PostgreSQL.
If you have many reads over writes, you may want to try MySQL assuming that the problem is with Postgres, but your problem is a write problem.
Still, you may want to look into your database design, and possibly consider sharding. For a really large database you may still have to look at the above 2 issues regardless.
You may also want to look at non-RDBMS database servers or document oriented like Mensia and CouchDB depending on the task at hand. No single tool will manage all tasks, so choose wisely.
Just out of curiosity, do you have any stored procedures that may be causing this delay?