Best Database for remote sensor data logging - database

I need to choose a Database for storing data remotely from a big number (thousands to tens of thousands) of sensors that would generate around one entry per minute each.
The said data needs to be queried in a variety of ways from counting data with certain characteristics for statistics to simple outputting for plotting.
I am looking around for the right tool, I started with MySQL but I feel like it lacks the scalability needed for this project, and this lead me to noSQL databases which I don't know much about.
Which Database, either relational or not would be a good choice?
Thanks.

There is usually no "best" database since they all involve trade-offs of one kind or another. Your question is also very vague because you don't say anything about your performance needs other than the number of inserts per minute (how much data per insert?) and that you need "scalability".
It also looks like a case of premature optimization because you say you "feel like [MySQL] lacks the scalability needed for this project", but it doesn't sound like you've run any tests to confirm whether this is a real problem. It's always better to get real data rather than base an important architectural decision on "feelings".
Here's a suggestion:
Write a simple test program that inserts 10,000 rows of sample data per minute
Run the program for a decent length of time (a few days or more) to generate a sizable chunk of test data
Run your queries to see if they meet your performance needs (which you haven't specified -- how fast do they need to be? how often will they run? how complex are they?)
You're testing at least two things here: whether your database can handle 10,000 inserts per minute and whether your queries will run quickly enough once you have a huge amount of data. With large datasets these will become competing priorities since you need indexes for fast queries, but indexes will start to slow down your inserts over time. At some point you'll need to think about data archival as well (or purging, if historical data isn't needed) both for performance and for practical reasons (finite storage space).
These will be concerns no matter what database you select. From what little you've told us about your retrieval needs ("counting data with certain characteristics" and "simple outputting for plotting") it sounds like any type of database will do. It may be that other concerns are more important, such as ease of development (what languages and tools are you using?), deployment, management, code maintainability, etc.
Since this is sensor data we're talking about, you may also want to look at a round robin database (RRD) such as RRDTool to see if that approach better serves your needs.

Found this question while googling for "database for sensor data"
One of very helpful search-results (along with this SO question) was this blog:
Actually I've started a similar project (http://reatha.de) but realized too late, that I'm using not the best technologies available. My approach was similar MySQL + PHP. Finally I realized that this is not scalable and stopped the project.
Additionally, a good starting point is looking at the list of data-bases in Heroku:
If they use one, then it should be not the worst one.
I hope this helps.

you can try to use Redis noSQL database

Related

NoSQL or RDBMS for audit data

I know that similar questions were asked in the subject, but I still haven't seen anyone that completely contained all my requests.
I would start by saying that I only have experience in RDBMS's so I'm sorry if I get anything regarding NoSQL wrong.
I'm creating a database that would hold a large amount of audit logs (about 1TB).
I'm using it for:
Fast data writing (a massive amount of audit logs is written all the time)
Search - search over the audit data (search actions performed by a certain user, at a certain time or a certain action... the database should support searching any of the 'columns' very quickly)
Analytics & Reporting - Generate daily, weekly, monthly reports of the data (They are predefined at the moment.. if they are more dynamic, does it affect the solution I should choose?)
Reliability (support for fail-over or any similar feature), Scalability (If I grow above 1TB to 2TB, 10TB or 100TB - does any of the solutions can't support this amount of data?) and of course Performance (in the use cases I specified) are very important to me.
I know RDBMS and that would be my easy way of starting, but I'm really concerned that after a while, the DB would simply not keep up with the pace.
My question is should I pick an RDBMS or NoSQL solution and why? If a NoSQL solution, since they are so different, which of them do you think fits my needs?
Generally there isn't a right or wrong answer here.
Fast data writing, either solution will be ok, although you didn't say what volume per second you are storing. Both solutions have things to watch out for.
Search (very quick) over all columns. For smaller volumes, say few hundred Gb, then either solution will be Ok (assuming skilled people put it together). You didn't actually say how fast/often you search, so if it is many times per minute this consideration becomes more important. Fast search can often slow down ability to write high volumes quickly as indexes required for search need to be updated.
Audit records typically have a time component, so searching that is time constrained, eg within last 7 days, will significantly speed up search times compared to search all records.
Reporting. When you get up to 100Tb, you are going to need some real tricks, or a big budget, to get fast reporting. For static reporting, you will probably end up creating one program that generates multiple reports at once to save I/O. Dynamic reports will be the tricky one.
My opinion? Since you know RDBMS, I would start with that as a method and ship the solution. This buys you time to learn the real problems you will encounter (the no premature optimization that many on SO are keen on). During this initial timeframe you can start to select nosql solutions and learn them. I am assuming here that you want to run your own hardware/database, if you want to use cloud type solutions, then go to them straight away.

What NoSQL database should I be using?

Ok, so I've been doing a bit of research into NoSQL databases, and they seem to be the right option for what I need. The problem is however, that a lot of these databases, if not most of them are reading to/writing from RAM, as opposed to disk. That's great when you have plenty of server resources or don't expect massive data blocks - but I think I should prepare for the worst.
What I expect to receive from these data sources is anywhere from 25KB to 150KB per query - yup - up to 150KB for a single key value. The average user will produce anywhere from 500 to 5000 of these keys and they can grow infinitely (but will probably stop somewhere in that 5000 range). If you quickly do the calculations (most of the data will be on the higher end of 25-150, so I'll use 100KB as an "average", most users will probably produce 2000-3000 queries): 100KB*3000 - that's 300MB per user! An insane amount of data when you start getting a decent userbase. So, ultimately I'll probably throw away most of the data in the queries so it is no more than 1KB or so, but that will still far surpass most RAM capabilities.
So I think what I'm looking for is a solution that will store data to disk, and cache objects in RAM.. But I'm open to all solutions! Let me know what you guys think. I would love to keep this thing running fast...
Edit:
Wording it slightly differently as to be useful to a passerby:
If one is looking to maximize performance but handle large dataloads in a NoSQL database, what would be the recommended NoSQL database? I would think it would be one which stores data to disk, but this can compromise performance significantly. Is there a "best of both worlds" solution out there? It is important to note I assume, that these records would not be modified once they were submitted, only read from (but maybe not even that often).
I've been looking into Redis for such a task, because it looks very clean to manage - however it runs entirely in RAM, thus requires small data blocks, or multiple servers running multiple instances at once.. Which is something I don't have access to.
First of all, I think when you say most you've seen store data in RAM, you refer to in memory Key/Value data stores like Redis or Memcached.
But there's more than that. Before closing the discussion on in-memory NoSQL options, I should say that you are right. Memory fills up quite easily and you would need tons of it, judging from your requirements. So in-memory options should be discarded (not they're not useful, but not not in this specific situation).
My proposal is MongoDb. Does what you need: stores data on disk, caches stuff in-memory (as much as it can).
However, you need some powerful data storage options (SSD is what you should think about) so it can handle your data throughput needs. I've tested Mongo, but with far less data.
I was looking for over 1 million elements collections, with value sizes ranging from 5Kb to 50Kb.
I was mostly interested in read speeds. I should also mention write speeds, which I tested, and must say that they are impressive. One million 20Kb inserts in a few minutes (on a small server - quad core, 8GB of RAM, VMware VM).
Getting back to read speeds, I was looking for semi-concurrent queries that would give me under 50msec read times for around 100 concurrent users.
With some help from the MongoDb team I managed to get close to those times, but then I got into something else and had to drop my research (temporarily, I hope to resume it soon). There are far more things to look into, as speeds for aggregates, map/reduce, etc.
I can say that query times on the server were super fast and all the overhead was added by BSON serialization/deserialization and transport over the network.
So, for you Mongo would be appropriate, but you have to back it up with some good hardware.
You should really install it and test it in your specific situation and draw your conclusions from your own tests.
If you're going to do it and your client is .NET, then you should use their official driver. Otherwise, there are plenty others listed here: http://www.mongodb.org/display/DOCS/Drivers.
A good intro on Mongo features and how to use them can be found here: http://www.mongodb.org/display/DOCS/Developer+Zone. Granted, their documentation is not as good as the one for RavenDb (another NOSQL solution I've tested, but not nearly as fast) but you can get good support here or on Google Groups.

How to modernize an enormous legacy database?

I have a question, just looking for suggestions here.
So, my application is 'modernizing' a desktop application by converting it to the web, with an ICEFaces UI and server side written in Java. However, they are keeping around the same Oracle database, which at current count has about 700-900 tables and probably a billion total records in the tables. Some individual tables have 250 million rows, many have over 25 million.
Needless to say, the database is not scaling well. As a result, the performance of the application is looking to be abysmal. The architects / decision makers-that-be have all either refused or are unwilling to restructure the persistence. So, basically we are putting a fresh coat of paint on a functional desktop application that currently serves most user needs and does so with relative ease. The actual database performance is pretty slow in the desktop app now. The quick performance I referred to earlier was non-database related stuff (sorry I misspoke there). I am having trouble sleeping at night thinking of how poorly this application is going to perform and how difficult it is going to be for everyday users to do their job.
So, my question is, what options do I have to mitigate this impending disaster? Is there some type of intermediate layer I can put in between the database and the Java code to speed up performance while at the same time keeping the database structure intact? Caching is obviously an option, but I don't see that as being a cure-all. Is it possible to layer a NoSQL DB in between or something?
I don't understand how to reconcile two things you said.
Needless to say, the database is not scaling well
and
currently serves most user needs and does so with relative ease and quick performance.
You don't say you are adding new users or new function, just making the same function accessible via a web interface.
So why is there a problem. Your Web App will be doing more or less the same database work as before.
In fact introducing a web tier could well give new caching opportunities so reducing the work the DB is doing.
If your early pieces of web app development are showing poor performance then I would start by trying to understand how the queries you are doing in the web app differ from those done by the existing app. Is it possible that you are using some tooling which is taking a somewhat naive approach to generating queries?
If the current app performs well and your new java app doesn't, the problem is not in the database layer, but in your application layer. If performance is as bad as you say, they should notice fairly early and have the option of going back to the Desktop application.
The DBA should be able to readily identify the additional workload on the database from your application. Assuming the logic hasn't changed it is unlikely to be doing more writes. It could be reads or it could be 'chattier' (moving the same amount of information but in smaller parcels). Chatty applications can use a lot of CPU. A lot of architects try to move processing from the database layer into the application layer because "work on the database is expensive" but actually make things worse due to the overhead of the "to-and-fro".
PS.
There's nothing 'bad' about having 250 million rows in a table. Generally you access a table through an index. There are typically 2 or 3 hops from the top of an index to the bottom (and then one more to the table). I've got a 20 million row table with a BLEVEL of 2 and a 120+ million row table with a BLEVEL of 3.
Indexing means that you rarely hit more than a small proportion of your data blocks. The frequently used index blocks (and data blocks) get cached in the database server's memory. The DBA would be able to see if this memory area is too small for the workload (ie a lot of physical disk IO).
If your app is getting a lot of information that it doesn't really need, this can put pressure on the memory space. Don't be greedy. if you only need three columns from a row, don't grab the whole row.
What you describe is something that Oracle should be capable of handling very easily if you have the right equipment and database design. It should scale well if you get someone on your team who is a specialist in performance tuning large applications.
Redoing the database from scratch would cost a fortune and would introduce new bugs and the potential for loss of critical information is huge. It almost never is a better idea to rewrite the database at this point. Usually those kinds of projects fail miserably after costing the company thousands or even millions of dollars. Your architects made the right choice. Learn to accept that what you want isn't always the best way. The data is far more important to the company than the app. There are many reasons why people have learned not to try to redesign the database from scratch.
Now there are ways to improve database performance. First thing I would consider with a database this size is partioning the data. I would also consider archiving old data to a data warehouse and doing most reporting from that. Other things to consider would be improving your servers to higher performing models, profiling to find slowest running queries and individually fixing them, looking at indexing, updating statistics and indexes (not sure if this is what you do on Oracle, I'm a SLQ Server gal but your dbas would know). There are some good books on refactoring old legacy databases. The one below is not datbase specific.
http://www.amazon.com/Refactoring-Databases-Evolutionary-Database-Design/dp/0321293533/ref=sr_1_1?ie=UTF8&s=books&qid=1275577997&sr=8-1
There are also some good books on performance tuning (look for ones specific to Oracle, what works for SQL Server or mySQL is not what is best for Oracle)
Personally I would get those and read them from cover to cover before designing a plan for how you are going to fix the poor performance. I would also include the DBAs in all your planning, they know things that you do not about the database and why some things are designed the way they are.
If you have a lot of lookups that are for items not in the database you can reduce the number by using a bloom filter. Add everything in the database to the bloom filter then before you do a lookup check the bloom first. Only if the bloom reports it present do you need to bother the database. The bloom will result in false positives but you can design it to the 'size vs false positive' trade off that best suits you.
The strategy is used by Google in their big-table database and they have reported that it significantly improves performance.
http://en.wikipedia.org/wiki/Bloom_filter
Good luck, working on tasks you don't believe in is tough.
So you put a fresh coat of paint on a functional and quick desktop application and then the system becomes slow?
And then you say that "it is needless to say that the database isn't scaling well"?
I don't get it. I think that there is something wrong with your fresh coat of paint, not with the database.
Don't be put down by this sort of thing. See it as a challenge, rather than something to be losing sleep over! I know it's tempting as a programmer to want to rip everything out and start over again, but from a business perspective, it's just not always viable. For example, by using the same database, the business can continue to use the old application while the new one is being developed and switch over customers in groups, rather than having to switch everyone over at the same time.
As for what you can do about performance, it depends a lot on the usage pattern. Caching can help greatly with mostly read-only databases. Even with read/write database, it can still be a boon if correctly designed. A NoSQL database might help with write-heavy stuff, but it might also be more trouble than it's worth if the data has to end up in a regular database anyway.
In the end, it all depends greatly on your application's architecture and usage patterns.
Good luck!
Well without knowing too much about what kinds of queries that are mostly done (I would expact lookups to be more common) perhaps you should try caching first. And cache at different layers, at the layer before the app server if possible and of course what you suggested caching at the layer between the app server and the database.
Caching works well for read data and it might not be as bad as you think.
Have you looked at Terracotta ? They do have some caching and scaling stuff that might be relavant to you.
Take it as a challenge!
The way to 'mitigate this impending disaster' is to do what you should be doing anyway. If you follow best practices the pain of switching out your persistence layer at a later stage will be minimal.
Up until the time that you have valid performance benchmarks and identified bottlenecks in the system talk of performance is premature. In any case I would be surprised if many of the 'intermediate layer' strategies aren't already implemented at the database level.
If the database is legacy and enormous, then
1) it cannot be changed in a way that will change the interface, as this will break too many existing applications. Or, if you change the interface, this has to be coordinated with modifying multiple applications with associated testing.
2) If the issue is performance, then there are probably many changes that can be made to optimize the database without changing the interface.
3) Views can be used to maintain the existing interfaces while restructuring tables for more efficiency, or possibly to allow more efficient access in the future.
4) Standard database optimizations, such as performance analysis, indexing, caching can probably greatly increase efficiency and performance without changing the interface.
There's a lot more that can be done, but you get the idea. It can't really be updated in one single big change. Changes have to be incremental, or transparent to the applications that use it.
The database is PART of the application. Don't consider them to be separate, it isn't.
As developer, you need to be free to make schema changes as necessary, and suggest data changes to improve performance / functionality in production (for example archiving old data).
Your development system presumably does not have that much data, but has the exact same schema.
In order to do performance testing, you will need a system with the same hardware and same size data (same data if possible) as production. You should explain to management that performance testing is absolutely necessary as you feel the app isn't going to perform.
Of course making schema changes (adding / removing indexes, splitting tables out etc) may affect other parts of the system - which you should consider as parts of a SYSTEM - and hence do the necessary regression testing and fixing.
If you need to modify the database schema, and make changes to the desktop client accordingly, to make the web app perform, that is what you have to do - justify your design decision to the management.

What should every developer know about databases? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
Whether we like it or not, many if not most of us developers either regularly work with databases or may have to work with one someday. And considering the amount of misuse and abuse in the wild, and the volume of database-related questions that come up every day, it's fair to say that there are certain concepts that developers should know - even if they don't design or work with databases today.
What is one important concept that developers and other software professionals ought to know about databases?
The very first thing developers should know about databases is this: what are databases for? Not how do they work, nor how do you build one, nor even how do you write code to retrieve or update the data in a database. But what are they for?
Unfortunately, the answer to this one is a moving target. In the heydey of databases, the 1970s through the early 1990s, databases were for the sharing of data. If you were using a database, and you weren't sharing data you were either involved in an academic project or you were wasting resources, including yourself. Setting up a database and taming a DBMS were such monumental tasks that the payback, in terms of data exploited multiple times, had to be huge to match the investment.
Over the last 15 years, databases have come to be used for storing the persistent data associated with just one application. Building a database for MySQL, or Access, or SQL Server has become so routine that databases have become almost a routine part of an ordinary application. Sometimes, that initial limited mission gets pushed upward by mission creep, as the real value of the data becomes apparent. Unfortunately, databases that were designed with a single purpose in mind often fail dramatically when they begin to be pushed into a role that's enterprise wide and mission critical.
The second thing developers need to learn about databases is the whole data centric view of the world. The data centric world view is more different from the process centric world view than anything most developers have ever learned. Compared to this gap, the gap between structured programming and object oriented programming is relatively small.
The third thing developers need to learn, at least in an overview, is data modeling, including conceptual data modeling, logical data modeling, and physical data modeling.
Conceptual data modeling is really requirements analysis from a data centric point of view.
Logical data modeling is generally the application of a specific data model to the requirements discovered in conceptual data modeling. The relational model is used far more than any other specific model, and developers need to learn the relational model for sure. Designing a powerful and relevant relational model for a nontrivial requirement is not a trivial task. You can't build good SQL tables if you misunderstand the relational model.
Physical data modeling is generally DBMS specific, and doesn't need to be learned in much detail, unless the developer is also the database builder or the DBA. What developers do need to understand is the extent to which physical database design can be separated from logical database design, and the extent to which producing a high speed database can be accomplished just by tweaking the physical design.
The next thing developers need to learn is that while speed (performance) is important, other measures of design goodness are even more important, such as the ability to revise and extend the scope of the database down the road, or simplicity of programming.
Finally, anybody who messes with databases needs to understand that the value of data often outlasts the system that captured it.
Whew!
Good question. The following are some thoughts in no particular order:
Normalization, to at least the second normal form, is essential.
Referential integrity is also essential, with proper cascading delete and update considerations.
Good and proper use of check constraints. Let the database do as much work as possible.
Don't scatter business logic in both the database and middle tier code. Pick one or the other, preferably in middle tier code.
Decide on a consistent approach for primary keys and clustered keys.
Don't over index. Choose your indexes wisely.
Consistent table and column naming. Pick a standard and stick to it.
Limit the number of columns in the database that will accept null values.
Don't get carried away with triggers. They have their use but can complicate things in a hurry.
Be careful with UDFs. They are great but can cause performance problems when you're not aware how often they might get called in a query.
Get Celko's book on database design. The man is arrogant but knows his stuff.
First, developers need to understand that there is something to know about databases. They're not just magic devices where you put in the SQL and get out result sets, but rather very complicated pieces of software with their own logic and quirks.
Second, that there are different database setups for different purposes. You do not want a developer making historical reports off an on-line transactional database if there's a data warehouse available.
Third, developers need to understand basic SQL, including joins.
Past this, it depends on how closely the developers are involved. I've worked in jobs where I was developer and de facto DBA, where the DBAs were just down the aisle, and where the DBAs are off in their own area. (I dislike the third.) Assuming the developers are involved in database design:
They need to understand basic normalization, at least the first three normal forms. Anything beyond that, get a DBA. For those with any experience with US courtrooms (and random television shows count here), there's the mnemonic "Depend on the key, the whole key, and nothing but the key, so help you Codd."
They need to have a clue about indexes, by which I mean they should have some idea what indexes they need and how they're likely to affect performance. This means not having useless indices, but not being afraid to add them to assist queries. Anything further (like the balance) should be left for the DBA.
They need to understand the need for data integrity, and be able to point to where they're verifying the data and what they're doing if they find problems. This doesn't have to be in the database (where it will be difficult to issue a meaningful error message for the user), but has to be somewhere.
They should have the basic knowledge of how to get a plan, and how to read it in general (at least enough to tell whether the algorithms are efficient or not).
They should know vaguely what a trigger is, what a view is, and that it's possible to partition pieces of databases. They don't need any sort of details, but they need to know to ask the DBA about these things.
They should of course know not to meddle with production data, or production code, or anything like that, and they should know that all source code goes into a VCS.
I've doubtless forgotten something, but the average developer need not be a DBA, provided there is a real DBA at hand.
Basic Indexing
I'm always shocked to see a table or an entire database with no indexes, or arbitrary/useless indexes. Even if you're not designing the database and just have to write some queries, it's still vital to understand, at a minimum:
What's indexed in your database and what's not:
The difference between types of scans, how they're chosen, and how the way you write a query can influence that choice;
The concept of coverage (why you shouldn't just write SELECT *);
The difference between a clustered and non-clustered index;
Why more/bigger indexes are not necessarily better;
Why you should try to avoid wrapping filter columns in functions.
Designers should also be aware of common index anti-patterns, for example:
The Access anti-pattern (indexing every column, one by one)
The Catch-All anti-pattern (one massive index on all or most columns, apparently created under the mistaken impression that it would speed up every conceivable query involving any of those columns).
The quality of a database's indexing - and whether or not you take advantage of it with the queries you write - accounts for by far the most significant chunk of performance. 9 out of 10 questions posted on SO and other forums complaining about poor performance invariably turn out to be due to poor indexing or a non-sargable expression.
Normalization
It always depresses me to see somebody struggling to write an excessively complicated query that would have been completely straightforward with a normalized design ("Show me total sales per region.").
If you understand this at the outset and design accordingly, you'll save yourself a lot of pain later. It's easy to denormalize for performance after you've normalized; it's not so easy to normalize a database that wasn't designed that way from the start.
At the very least, you should know what 3NF is and how to get there. With most transactional databases, this is a very good balance between making queries easy to write and maintaining good performance.
How Indexes Work
It's probably not the most important, but for sure the most underestimated topic.
The problem with indexing is that SQL tutorials usually don't mention them at all and that all the toy examples work without any index.
Even more experienced developers can write fairly good (and complex) SQL without knowing more about indexes than "An index makes the query fast".
That's because SQL databases do a very good job working as black-box:
Tell me what you need (gimme SQL), I'll take care of it.
And that works perfectly to retrieve the correct results. The author of the SQL doesn't need to know what the system is doing behind the scenes--until everything becomes sooo slooooow.....
That's when indexing becomes a topic. But that's usually very late and somebody (some company?) is already suffering from a real problem.
That's why I believe indexing is the No. 1 topic not to forget when working with databases. Unfortunately, it is very easy to forget it.
Disclaimer
The arguments are borrowed from the preface of my free eBook "Use The Index, Luke". I am spending quite a lot of my time explaining how indexes work and how to use them properly.
I just want to point out an observation - that is that it seems that the majority of responses assume database is interchangeable with relational databases. There are also object databases, flat file databases. It is important to asses the needs of the of the software project at hand. From a programmer perspective the database decision can be delayed until later. Data modeling on the other hand can be achieved early on and lead to much success.
I think data modeling is a key component and is a relatively old concept yet it is one that has been forgotten by many in the software industry. Data modeling, especially conceptual modeling, can reveal the functional behavior of a system and can be relied on as a road map for development.
On the other hand, the type of database required can be determined based on many different factors to include environment, user volume, and available local hardware such as harddrive space.
Avoiding SQL injection and how to secure your database
Every developer should know that this is false: "Profiling a database operation is completely different from profiling code."
There is a clear Big-O in the traditional sense. When you do an EXPLAIN PLAN (or the equivalent) you're seeing the algorithm. Some algorithms involve nested loops and are O( n ^ 2 ). Other algorithms involve B-tree lookups and are O( n log n ).
This is very, very serious. It's central to understanding why indexes matter. It's central to understanding the speed-normalization-denormalization tradeoffs. It's central to understanding why a data warehouse uses a star-schema which is not normalized for transactional updates.
If you're unclear on the algorithm being used do the following. Stop. Explain the Query Execution plan. Adjust indexes accordingly.
Also, the corollary: More Indexes are Not Better.
Sometimes an index focused on one operation will slow other operations down. Depending on the ratio of the two operations, adding an index may have good effects, no overall impact, or be detrimental to overall performance.
I think every developer should understand that databases require a different paradigm.
When writing a query to get at your data, a set-based approach is needed. Many people with an interative background struggle with this. And yet, when they embrace it, they can achieve far better results, even though the solution may not be the one that first presented itself in their iterative-focussed minds.
Excellent question. Let's see, first no one should consider querying a datbase who does not thoroughly understand joins. That's like driving a car without knowing where the steering wheel and brakes are. You also need to know datatypes and how to choose the best one.
Another thing that developers should understand is that there are three things you should have in mind when designing a database:
Data integrity - if the data can't be relied on you essentially have no data - this means do not put required logic in the application as many other sources may touch the database. Constraints, foreign keys and sometimes triggers are necessary to data integrity. Don't fail to use them because you don't like them or don't want to be bothered to understand them.
Performance - it is very hard to refactor a poorly performing database and performance should be considered from the start. There are many ways to do the same query and some are known to be faster almost always, it is short-sighted not to learn and use these ways. Read some books on performance tuning before designing queries or database structures.
Security - this data is the life-blood of your company, it also frequently contains personal information that can be stolen. Learn to protect your data from SQL injection attacks and fraud and identity theft.
When querying a database, it is easy to get the wrong answer. Make sure you understand your data model thoroughly. Remember often actual decisions are made based on the data your query returns. When it is wrong, the wrong business decisions are made. You can kill a company from bad queries or loose a big customer. Data has meaning, developers often seem to forget that.
Data almost never goes away, think in terms of storing data over time instead of just how to get it in today. That database that worked fine when it had a hundred thousand records, may not be so nice in ten years. Applications rarely last as long as data. This is one reason why designing for performance is critical.
Your database will probaly need fields that the application doesn't need to see. Things like GUIDs for replication, date inserted fields. etc. You also may need to store history of changes and who made them when and be able to restore bad changes from this storehouse. Think about how you intend to do this before you come ask a web site how to fix the problem where you forgot to put a where clause on an update and updated the whole table.
Never develop in a newer version of a database than the production version. Never, never, never develop directly against a production database.
If you don't have a database administrator, make sure someone is making backups and knows how to restore them and has tested restoring them.
Database code is code, there is no excuse for not keeping it in source control just like the rest of your code.
Evolutionary Database Design. http://martinfowler.com/articles/evodb.html
These agile methodologies make database change process manageable, predictable and testable.
Developers should know, what it takes to refactor a production database in terms of version control, continious integration and automated testing.
Evolutionary Database Design process has administrative aspects, for example a column is to be dropped after some life time period in all databases of this codebase.
At least know, that Database Refactoring concept and methodologies exist.
http://www.agiledata.org/essays/databaseRefactoringCatalog.html
Classification and process description makes it possible to implement tooling for these refactorings too.
About the following comment to Walter M.'s answer:
"Very well written! And the historical perspective is great for people who weren't doing database work at that time (i.e. me)".
The historical perspective is in a certain sense absolutely crucial. "Those who forget history, are doomed to repeat it.". Cfr XML repeating the hierarchical mistakes of the past, graph databases repeating the network mistakes of the past, OO systems forcing the hierarchical model upon users while everybody with even just a tenth of a brain should know that the hierarchical model is not suitable for general-purpose representation of the real world, etcetera, etcetera.
As for the question itself:
Every database developer should know that "Relational" is not equal to "SQL". Then they would understand why they are being let down so abysmally by the DBMS vendors, and why they should be telling those same vendors to come up with better stuff (e.g. DBMS's that are truly relational) if they want to go on sucking hilarious amounts of money out of their customers for such crappy software).
And every database developer should know everything about the relational algebra. Then there would no longer be a single developer left who had to post these stupid "I don't know how to do my job and want someone else to do it for me" questions on Stack Overflow anymore.
From my experience with relational databases, every developer should know:
- The different data types:
Using the correct type for the correct job will make your DB design more robust, your queries faster and your life easier.
- Learn about 1xM and MxM:
This is the bread and butter for relational databases. You need to understand one-to-many and many-to-many relations and apply then when appropriate.
- "K.I.S.S." principle applies to the DB as well:
Simplicity always works best. Provided you have studied how DB work, you will avoid unnecessary complexity which will lead to maintenance and speed problems.
- Indices:
It's not enough if you know what they are. You need to understand when to used them and when not to.
also:
Boolean algebra is your friend
Images: Don't store them on the DB. Don't ask why.
Test DELETE with SELECT
I would like everyone, both DBAs and developer/designer/architects, to better understand how to properly model a business domain, and how to map/translate that business domain model into both a normalized database logical model, an optimized physical model, and an appropriate object oriented class model, each one of which is (can be) different, for various reasons, and understand when, why, and how they are (or should be) different from one another.
I would say strong basic SQL skills. I've seen a lot of developers so far who know a little about databases but are always asking for tips about how to formulate a quite simple query. Queries are not always that easy and simple. You do have to use multiple joins (inner, left, etc.) when querying a well normalized database.
I think a lot of the technical details have been covered here and I don't want to add to them. The one thing I want to say is more social than technical, don't fall for the "DBA knowing the best" trap as an application developer.
If you are having performance issues with query take ownership of the problem too. Do your own research and push for the DBAs to explain what's happening and how their solutions are addressing the problem.
Come up with your own suggestions too after you have done the research. That is, I try to find a cooperative solution to the problem rather than leaving database issues to the DBAs.
Simple respect.
It's not just a repository
You probably don't know better than the vendor or the DBAs
You won't support it at 3 a.m. with senior managers shouting at you
Consider Denormalization as a possible angel, not the devil, and also consider NoSQL databases as an alternative to relational databases.
Also, I think the Entity-Relation model is a must-know for every developper even if you don't design databases. It'll let you understand thoroughly what's your database all about.
Never insert data with the wrong text encoding.
Once your database becomes polluted with multiple encodings, the best you can do is apply some kind combination of heuristics and manual labor.
Aside from syntax and conceptual options they employ (such as joins, triggers, and stored procedures), one thing that will be critical for every developer employing a database is this:
Know how your engine is going to perform the query you are writing with specificity.
The reason I think this is so important is simply production stability. You should know how your code performs so you're not stopping all execution in your thread while you wait for a long function to complete, so why would you not want to know how your query will affect the database, your program, and perhaps even the server?
This is actually something that has hit my R&D team more times than missing semicolons or the like. The presumtion is the query will execute quickly because it does on their development system with only a few thousand rows in the tables. Even if the production database is the same size, it is more than likely going to be used a lot more, and thus suffer from other constraints like multiple users accessing it at the same time, or something going wrong with another query elsewhere, thus delaying the result of this query.
Even simple things like how joins affect performance of a query are invaluable in production. There are many features of many database engines that make things easier conceptually, but may introduce gotchas in performance if not thought of clearly.
Know your database engine execution process and plan for it.
For a middle-of-the-road professional developer who uses databases a lot (writing/maintaining queries daily or almost daily), I think the expectation should be the same as any other field: You wrote one in college.
Every C++ geek wrote a string class in college. Every graphics geek wrote a raytracer in college. Every web geek wrote interactive websites (usually before we had "web frameworks") in college. Every hardware nerd (and even software nerds) built a CPU in college. Every physician dissected an entire cadaver in college, even if she's only going to take my blood pressure and tell me my cholesterol is too high today. Why would databases be any different?
Unfortunately, they do seem different, today, for some reason. People want .NET programmers to know how strings work in C, but the internals of your RDBMS shouldn't concern you too much.
It's virtually impossible to get the same level of understanding from just reading about them, or even working your way down from the top. But if you start at the bottom and understand each piece, then it's relatively easy to figure out the specifics for your database. Even things that lots of database geeks can't seem to grok, like when to use a non-relational database.
Maybe that's a bit strict, especially if you didn't study computer science in college. I'll tone it down some: You could write one today, completely, from scratch. I don't care if you know the specifics of how the PostgreSQL query optimizer works, but if you know enough to write one yourself, it probably won't be too different from what they did. And you know, it's really not that hard to write a basic one.
The order of columns in a non-unique index is important.
The first column should be the column that has the most variability in its content (i.e. cardinality).
This is to aid SQL Server ability to create useful statistics in how to use the index at runtime.
Understand the tools that you use to program the database!!!
I wasted so much time trying to understand why my code was mysteriously failing.
If you're using .NET, for example, you need to know how to properly use the objects in the System.Data.SqlClient namespace. You need to know how to manage your SqlConnection objects to make sure they are opened, closed, and when necessary, disposed properly.
You need to know that when you use a SqlDataReader, it is necessary to close it separately from your SqlConnection. You need to understand how to keep connections open when appropriate to how to minimize the number of hits to the database (because they are relatively expensive in terms of computing time).
Basic SQL skills.
Indexing.
Deal with different incarnations of DATE/ TIME/ TIMESTAMP.
JDBC driver documentation for the platform you are using.
Deal with binary data types (CLOB, BLOB, etc.)
For some projects, and Object-Oriented model is better.
For other projects, a Relational model is better.
The impedance mismatch problem, and know the common deficiencies or ORMs.
RDBMS Compatibility
Look if it is needed to run the application in more than one RDBMS. If yes, it might be necessary to:
avoid RDBMS SQL extensions
eliminate triggers and store procedures
follow strict SQL standards
convert field data types
change transaction isolation levels
Otherwise, these questions should be treated separately and different versions (or configurations) of the application would be developed.
Don't depend on the order of rows returned by an SQL query.
Three (things) is the magic number:
Your database needs version control too.
Cursors are slow and you probably don't need them.
Triggers are evil*
*almost always

which db should i select if performance of postgres is low

In a web app that support more than 5000 users, postgres is becoming the bottle neck.
It takes more than 1 minute to add a new user.(even after optimizations and on Win 2k3)
So, as a design issue, which other DB's might be better?
Most likely, it's not PostgreSQL, it's your design. Changing shoes most likely will not make you a better dancer.
Do you know what is causing slowness? Is it contention, time to update indexes, seek times?
Are all 5000 users trying to write to the user table at the same exact time as you are trying to insert 5001st user? That, I can believe can cause a problem. You might have to go with something tuned to handling extreme concurrency, like Oracle.
MySQL (I am told) can be optimized to do faster reads than PostgreSQL, but both are pretty ridiculously fast in terms of # transactions/sec they support, and it doesn't sound like that's your problem.
P.S.
We were having a little discussion in the comments to a different answer -- do note that some of the biggest, storage-wise, databases in the world are implemented using Postgres (though they tend to tweak the internals of the engine). Postgres scales for data size extremely well, for concurrency better than most, and is very flexible in terms of what you can do with it.
I wish there was a better answer for you, 30 years after the technology was invented, we should be able to make users have less detailed knowledge of the system in order to have it run smoothly. But alas, extensive thinking and tweaking is required for all products I am aware of. I wonder if the creators of StackOverflow could share how they handled db concurrency and scalability? They are using SQLServer, I know that much.
P.P.S.
So as chance would have it I slammed head-first into a concurrency problem in Oracle yesterday. I am not totally sure I have it right, not being a DBA, but what the guys explained was something like this: We had a large number of processes connecting to the DB and examining the system dictionary, which apparently forces a short lock on it, despite the fact that it's just a read. Parsing queries does the same thing.. so we had (on a multi-tera system with 1000s of objects) a lot of forced wait times because processes were locking each other out of the system. Our system dictionary was also excessively big because it contains a separate copy of all the information for each partition, of which there can be thousands per table. This is not really related to PostgreSQL, but the takeaway is -- in addition to checking your design, make sure your queries are using bind variables and getting reused, and pressure is minimal on shared resources.
Please change the OS under which you run Postgres - the Windows port, though immensely useful for expanding the user base, is still not on a par with the (much older and more mature) Un*x ports (and especially the Linux one).
Ithink your best choice is still PostgresSQL. Spend the time to make sure you have properly tuned your application. After your confident you have reached the limits of what can be done with tuning, start cacheing everything you can. After that, start think about moving to an asynchronous master slave setup...Also are you running OLAP type functionality on the same database your doing OLTP on?
Let me introduce you to the simplest, most practical way to scale almost any database server if the database design is truly optimal: just double your ram for an instantaneous boost in performance. It's like magic.
PostgreSQL scales better than most, if you are going to stay with a relational db, Oracle would be it. ODBMS scale better but they have their own issues, as in that it is closer to programming to set one up.
Yahoo uses PostgreSQL, that should tell you something about is scalability.
As highlighted above the problem is not with the particular database you are using, i.e. PostgreSQL but one of the following:
Schema design, maybe you need to add, remove, refine your indexes
Hardware maybe you are asking to much of your server - you said 5k users but then again very few of them are probably querying the db at the same time
Queries: perhaps poorly defined resulting in lots of inefficiency
A pragmatic way to find out what is happening is to analyse the PostgeSQL log files and find out what queries in terms of:
Most frequently executed
Longest running
etc. etc.
A quick review will tell you where to focus your efforts and you will most likely resolve your issues fairly quickly. There is no silver bullet, you have to do some homework but this will be small compared to changing your db vendor.
Good news ... there are lots of utilities to analayse your log files that are easy to use and produce easy to interpret results, here are two:
pgFouine - a PostgreSQL log analyzer (PHP)
pgFouine: Sample report
PQA (ruby)
PQA: Sample report
First, I would make sure the optimizations are, indeed, useful. For example, if you have many indexes, sometimes adding or modifying a record can take a long time.
I know there are several big projects running over PostgreSQL, so take a look at this issue.
I'd suggest looking here for information on PostgreSQL's performance: http://enfranchisedmind.com/blog/2006/11/04/postgres-for-the-win
What version of PG are you running? As the releases have progressed, performance has improved greatly.
Hi had the same issue previously with my current company. When I first joined them, their queries were huge and very slow. It takes 10 minutes to run them. I was able to optimize them to a few milliseconds or 1 to 2 seconds. I have learned many things during that time and I will share a few highlights in them.
Check your query first. doing an inner join of all the tables that you need will always take sometime. One thing that I would suggest is always start off with the table with which you can actually cut your data to those that you need.
e.g. SELECT * FROM (SELECT * FROM person WHERE person ilike '%abc') AS person;
If you look at the example above, this will cut your results to something that you know you need and you can refine them more by doing an inner join. This is one of the best way to speed up your query but there are more than one way to skin a cat. I cannot explain all of them here because there are just too many but from the example above, you just need to modify that to suite your need.
It depends on your postgres version. Older postgres does internally optimize the query. on example is that on postgres 8.2 and below, IN statements are slower than 8.4's.
EXPLAIN ANALYZE is your friend. if your query is running slow, do an explain analyze to determine which of it is causing the slowness.
Vacuum your database. This will ensure that statistics on your database will almost match the actual result. Big difference in the statistics and actual will result on your query running slow.
If all of these does not help you, try modifying your postgresql.conf. Increase the shared memory and try to experiment with the configuration to better suite your needs.
Hope this helps, but of course, these are just for postgres optimization.
btw. 5000 users are not much. My DB contains users with about 200k to a million users.
If you do want to switch away from PostgreSQL, Sybase SQL Anywhere is number 5 in terms of price/performance on the TPC-C benchmark list. It's also the lowest price option (by far) on the top 10 list, and is the only non-Microsoft and non-Oracle entry.
It can easily scale to thousands of users and terabytes of data.
Full disclosure: I work on the SQL Anywhere development team.
We need more details: What version you are using? What is the memory usage of the server? Are you vacuuming the database? Your performance problems might be un-related to PostgreSQL.
If you have many reads over writes, you may want to try MySQL assuming that the problem is with Postgres, but your problem is a write problem.
Still, you may want to look into your database design, and possibly consider sharding. For a really large database you may still have to look at the above 2 issues regardless.
You may also want to look at non-RDBMS database servers or document oriented like Mensia and CouchDB depending on the task at hand. No single tool will manage all tasks, so choose wisely.
Just out of curiosity, do you have any stored procedures that may be causing this delay?

Resources