Testing custom ORM solution performance overhead - how to? - sql-server

I have created a prototype of a custom ORM tool using aspect oriented programming (PostSHarp) and achieving persistence ignorance (before compile-time). Now I tried to find out how much overhead does it introduce compared to using pure DataReader and ADO.NET. I made a test case - insert, read, delete data (about 1000 records) in MS SQL Server 2008 and MySQL Community Edition. I run this test multiple times using pure ADO.NET and my custom tool.
I expected that results will depend on many factors - memory, swapping, CPU, other processes so I ran tests for many times (20-40). But the results were really unexpected. They just differed too much between those cases. If there were just some extreme values, I could ignore them (maybe swapping ocurred or smth. like that) but they were so different that I am sure I cannot trust this kind of testing. Almost half of times my ORM showed 10% better performance than pure ADO.NET, other times it was -10%.
Is there any way I can make those tests reliable? I do not have a powerful computer with lots of memory, but maybe I somehow can make MS SQL and MySQL or ADO.NET to be as consistent as possible during those tests? And how about count of records - which is more reliable, using small amount of records and running more times or other way?

Have you seen ORMBattle.NET? See FAQ there, there are some ideas related to measuring performance overhead introduced by a particular ORM tool. Test suite is open source.
Concerning your results:
Some ORM tools automatically batch statement sequences (i.e. send several SQL statements together). If this feature is implemented well in ORM, it's easy to beat plain ADO.NET by 2-4 times on CRUD operations, if ADO.NET test does not involve batching. Tests on ORMBattle.NET test both cases.
A lot depends on how you establish transaction boundaries there. Please refer to ORMBattle.NET FAQ for details.
CRUD tests aren't best performance indicator at all. In general, it's pretty easy to get
peak possible performance here, since in general, RDBMS must do much more than ORM in this case.
P.S. I'm one of ORMBattle.NET authors, so if you're interested in details / possible contributions, you can contact me directly (or join ORMBattle.NET Google Groups).

I would run the test for a longer duration and with many more iterations as small differences would average out over time and you should get a clearer picture. Also, make sure you eliminate any external things that may be affecting your test, such as other processes running, non enough free memory, cold start vs warm start, network usage, etc.
Also, make sure that your database file and log file have enough free space allocated so you aren't waiting for the DB to grow the file during certain tests.

First of all you need to find out where does the variance come from. The ORM layer itself or the database?
Many times the source of such variance is the database itself. Databases are very complex systems, with many active processes inside that can interact with the result of performance measurements. To achieve some reproductible results you'll have to place your database under 'laboratory' conditions and make sure nothing unexpected happens. what that means depends from vendor to vendor and you need know some pretty advanced topics in order to tacle something like this. For instance, on a SQL Server database the typical sources of variation are:
cold cache vs. warm cache (both data and procedures)
log and database growth events
maintenance jobs
ghost cleanup
lazy writer
checkpoints
external memory pressure

Related

Best Database for remote sensor data logging

I need to choose a Database for storing data remotely from a big number (thousands to tens of thousands) of sensors that would generate around one entry per minute each.
The said data needs to be queried in a variety of ways from counting data with certain characteristics for statistics to simple outputting for plotting.
I am looking around for the right tool, I started with MySQL but I feel like it lacks the scalability needed for this project, and this lead me to noSQL databases which I don't know much about.
Which Database, either relational or not would be a good choice?
Thanks.
There is usually no "best" database since they all involve trade-offs of one kind or another. Your question is also very vague because you don't say anything about your performance needs other than the number of inserts per minute (how much data per insert?) and that you need "scalability".
It also looks like a case of premature optimization because you say you "feel like [MySQL] lacks the scalability needed for this project", but it doesn't sound like you've run any tests to confirm whether this is a real problem. It's always better to get real data rather than base an important architectural decision on "feelings".
Here's a suggestion:
Write a simple test program that inserts 10,000 rows of sample data per minute
Run the program for a decent length of time (a few days or more) to generate a sizable chunk of test data
Run your queries to see if they meet your performance needs (which you haven't specified -- how fast do they need to be? how often will they run? how complex are they?)
You're testing at least two things here: whether your database can handle 10,000 inserts per minute and whether your queries will run quickly enough once you have a huge amount of data. With large datasets these will become competing priorities since you need indexes for fast queries, but indexes will start to slow down your inserts over time. At some point you'll need to think about data archival as well (or purging, if historical data isn't needed) both for performance and for practical reasons (finite storage space).
These will be concerns no matter what database you select. From what little you've told us about your retrieval needs ("counting data with certain characteristics" and "simple outputting for plotting") it sounds like any type of database will do. It may be that other concerns are more important, such as ease of development (what languages and tools are you using?), deployment, management, code maintainability, etc.
Since this is sensor data we're talking about, you may also want to look at a round robin database (RRD) such as RRDTool to see if that approach better serves your needs.
Found this question while googling for "database for sensor data"
One of very helpful search-results (along with this SO question) was this blog:
Actually I've started a similar project (http://reatha.de) but realized too late, that I'm using not the best technologies available. My approach was similar MySQL + PHP. Finally I realized that this is not scalable and stopped the project.
Additionally, a good starting point is looking at the list of data-bases in Heroku:
If they use one, then it should be not the worst one.
I hope this helps.
you can try to use Redis noSQL database

How to modernize an enormous legacy database?

I have a question, just looking for suggestions here.
So, my application is 'modernizing' a desktop application by converting it to the web, with an ICEFaces UI and server side written in Java. However, they are keeping around the same Oracle database, which at current count has about 700-900 tables and probably a billion total records in the tables. Some individual tables have 250 million rows, many have over 25 million.
Needless to say, the database is not scaling well. As a result, the performance of the application is looking to be abysmal. The architects / decision makers-that-be have all either refused or are unwilling to restructure the persistence. So, basically we are putting a fresh coat of paint on a functional desktop application that currently serves most user needs and does so with relative ease. The actual database performance is pretty slow in the desktop app now. The quick performance I referred to earlier was non-database related stuff (sorry I misspoke there). I am having trouble sleeping at night thinking of how poorly this application is going to perform and how difficult it is going to be for everyday users to do their job.
So, my question is, what options do I have to mitigate this impending disaster? Is there some type of intermediate layer I can put in between the database and the Java code to speed up performance while at the same time keeping the database structure intact? Caching is obviously an option, but I don't see that as being a cure-all. Is it possible to layer a NoSQL DB in between or something?
I don't understand how to reconcile two things you said.
Needless to say, the database is not scaling well
and
currently serves most user needs and does so with relative ease and quick performance.
You don't say you are adding new users or new function, just making the same function accessible via a web interface.
So why is there a problem. Your Web App will be doing more or less the same database work as before.
In fact introducing a web tier could well give new caching opportunities so reducing the work the DB is doing.
If your early pieces of web app development are showing poor performance then I would start by trying to understand how the queries you are doing in the web app differ from those done by the existing app. Is it possible that you are using some tooling which is taking a somewhat naive approach to generating queries?
If the current app performs well and your new java app doesn't, the problem is not in the database layer, but in your application layer. If performance is as bad as you say, they should notice fairly early and have the option of going back to the Desktop application.
The DBA should be able to readily identify the additional workload on the database from your application. Assuming the logic hasn't changed it is unlikely to be doing more writes. It could be reads or it could be 'chattier' (moving the same amount of information but in smaller parcels). Chatty applications can use a lot of CPU. A lot of architects try to move processing from the database layer into the application layer because "work on the database is expensive" but actually make things worse due to the overhead of the "to-and-fro".
PS.
There's nothing 'bad' about having 250 million rows in a table. Generally you access a table through an index. There are typically 2 or 3 hops from the top of an index to the bottom (and then one more to the table). I've got a 20 million row table with a BLEVEL of 2 and a 120+ million row table with a BLEVEL of 3.
Indexing means that you rarely hit more than a small proportion of your data blocks. The frequently used index blocks (and data blocks) get cached in the database server's memory. The DBA would be able to see if this memory area is too small for the workload (ie a lot of physical disk IO).
If your app is getting a lot of information that it doesn't really need, this can put pressure on the memory space. Don't be greedy. if you only need three columns from a row, don't grab the whole row.
What you describe is something that Oracle should be capable of handling very easily if you have the right equipment and database design. It should scale well if you get someone on your team who is a specialist in performance tuning large applications.
Redoing the database from scratch would cost a fortune and would introduce new bugs and the potential for loss of critical information is huge. It almost never is a better idea to rewrite the database at this point. Usually those kinds of projects fail miserably after costing the company thousands or even millions of dollars. Your architects made the right choice. Learn to accept that what you want isn't always the best way. The data is far more important to the company than the app. There are many reasons why people have learned not to try to redesign the database from scratch.
Now there are ways to improve database performance. First thing I would consider with a database this size is partioning the data. I would also consider archiving old data to a data warehouse and doing most reporting from that. Other things to consider would be improving your servers to higher performing models, profiling to find slowest running queries and individually fixing them, looking at indexing, updating statistics and indexes (not sure if this is what you do on Oracle, I'm a SLQ Server gal but your dbas would know). There are some good books on refactoring old legacy databases. The one below is not datbase specific.
http://www.amazon.com/Refactoring-Databases-Evolutionary-Database-Design/dp/0321293533/ref=sr_1_1?ie=UTF8&s=books&qid=1275577997&sr=8-1
There are also some good books on performance tuning (look for ones specific to Oracle, what works for SQL Server or mySQL is not what is best for Oracle)
Personally I would get those and read them from cover to cover before designing a plan for how you are going to fix the poor performance. I would also include the DBAs in all your planning, they know things that you do not about the database and why some things are designed the way they are.
If you have a lot of lookups that are for items not in the database you can reduce the number by using a bloom filter. Add everything in the database to the bloom filter then before you do a lookup check the bloom first. Only if the bloom reports it present do you need to bother the database. The bloom will result in false positives but you can design it to the 'size vs false positive' trade off that best suits you.
The strategy is used by Google in their big-table database and they have reported that it significantly improves performance.
http://en.wikipedia.org/wiki/Bloom_filter
Good luck, working on tasks you don't believe in is tough.
So you put a fresh coat of paint on a functional and quick desktop application and then the system becomes slow?
And then you say that "it is needless to say that the database isn't scaling well"?
I don't get it. I think that there is something wrong with your fresh coat of paint, not with the database.
Don't be put down by this sort of thing. See it as a challenge, rather than something to be losing sleep over! I know it's tempting as a programmer to want to rip everything out and start over again, but from a business perspective, it's just not always viable. For example, by using the same database, the business can continue to use the old application while the new one is being developed and switch over customers in groups, rather than having to switch everyone over at the same time.
As for what you can do about performance, it depends a lot on the usage pattern. Caching can help greatly with mostly read-only databases. Even with read/write database, it can still be a boon if correctly designed. A NoSQL database might help with write-heavy stuff, but it might also be more trouble than it's worth if the data has to end up in a regular database anyway.
In the end, it all depends greatly on your application's architecture and usage patterns.
Good luck!
Well without knowing too much about what kinds of queries that are mostly done (I would expact lookups to be more common) perhaps you should try caching first. And cache at different layers, at the layer before the app server if possible and of course what you suggested caching at the layer between the app server and the database.
Caching works well for read data and it might not be as bad as you think.
Have you looked at Terracotta ? They do have some caching and scaling stuff that might be relavant to you.
Take it as a challenge!
The way to 'mitigate this impending disaster' is to do what you should be doing anyway. If you follow best practices the pain of switching out your persistence layer at a later stage will be minimal.
Up until the time that you have valid performance benchmarks and identified bottlenecks in the system talk of performance is premature. In any case I would be surprised if many of the 'intermediate layer' strategies aren't already implemented at the database level.
If the database is legacy and enormous, then
1) it cannot be changed in a way that will change the interface, as this will break too many existing applications. Or, if you change the interface, this has to be coordinated with modifying multiple applications with associated testing.
2) If the issue is performance, then there are probably many changes that can be made to optimize the database without changing the interface.
3) Views can be used to maintain the existing interfaces while restructuring tables for more efficiency, or possibly to allow more efficient access in the future.
4) Standard database optimizations, such as performance analysis, indexing, caching can probably greatly increase efficiency and performance without changing the interface.
There's a lot more that can be done, but you get the idea. It can't really be updated in one single big change. Changes have to be incremental, or transparent to the applications that use it.
The database is PART of the application. Don't consider them to be separate, it isn't.
As developer, you need to be free to make schema changes as necessary, and suggest data changes to improve performance / functionality in production (for example archiving old data).
Your development system presumably does not have that much data, but has the exact same schema.
In order to do performance testing, you will need a system with the same hardware and same size data (same data if possible) as production. You should explain to management that performance testing is absolutely necessary as you feel the app isn't going to perform.
Of course making schema changes (adding / removing indexes, splitting tables out etc) may affect other parts of the system - which you should consider as parts of a SYSTEM - and hence do the necessary regression testing and fixing.
If you need to modify the database schema, and make changes to the desktop client accordingly, to make the web app perform, that is what you have to do - justify your design decision to the management.

Performance Testing a Greenfield Database

Assuming that best practices have been followed when designing a new database, how does one go about testing the database in a way that can improve confidence in the database's ability to meet adequate performance standards, and that will suggest performance-enhancing tweaks to the database structure if they are needed?
Do I need test data? What does that look like if no usage patterns have been established for the database yet?
NOTE: Resources such as blog posts and book titles are welcome.
I would do a few things:
1) simulate user/application connection to the db and test load (load testing).
I would suggest connecting with many more users than are expected to actually use the system. You can have all your users log in or pick up third party software that will log in many many users and perform defined functions that you feel is an adequate test of your system.
2) insert many (possibly millions) of test records and load test again.(scalability testing). As tables grow you may find you need indexes where you didn't have them before. Or there could be problems with VIEWS or joins used through out the system.
3) Analyze the database. I am referring to the method of analyzing tables. Here is a boring page describing what it is. Also here is a link to a great article on Oracle datbase tuning. Some of which might relate to what you are doing.
4) Run queries generated by applications/users and run explain plans for them. This will, for example, tell you when you have full table scans. It will help you fix a lot of your issues.
5) Also backup and reload from these backups to show confidence in this as well.
You could use a tool such as RedGate's Data Generator to get a good load of test data in it to see how the schema performs under load. You're right that without knowing the usage patterns it's difficult to put together a perfect test plan but I presume you must have a rough idea as to the kind of queries that will be run against it.
Adequate performance standards are really defined by the specific client applications that will consume your database. Get a sql profiler trace going whilst the applications hit your db and you should be able to quickly spot any problem areas which may need more optimising (or even de-normalising in some cases).
+1 birdlips, agree with the suggestions. However, database load testing can be very tricky precisely because the first and the crucial step is about predicting as best as possible the data patterns that will be encountered in the real world. This task is best done in conjunction with at least one domain expert, as it's very much to do with functional, not technical aspects of the system.
Modeling data patterns is ever so critical as most SQL execution plans are based on table "statistics", i.e. counts and ratios, which are used by modern RDBMS to calculate the optimal query execution plan. Some people have written books on the so called "query optimizers", e.g. Cost Based Oracle Fundamentals and it's quite often a challenge troubleshooting some of these issues due to a lack of documentation of how the internals work (often intentional as RDBMS vendors don't want to reveal too much about the details).
Back to your question, I suggest the following steps:
Give yourself a couple of days/weeks/months (depending on the size and complexity of the project) to try to define the state of a 'mature' (e.g. 2-3 year old) database, as well as some performance test cases that you would need to execute on this large dataset.
Build all the scripts to pump in the baseline data. You can use 3rd party tools, but I often found them lacking in functionality to do some more advanced data distributions and also often its much faster to write SQLs than to learn new tools.
Build/implement the performance test scenario client! This now heavily depends on what kind of an application the DB needs to support. If you have a browser-based UI there are many tools such as LoadRunner, JMeter to do end-to-end testing. For web services there's SoapSonar, SoapUI... Maybe you'll have to write a custom JDBC/ODBC/.Net client with multi-threading capabilities...
Test -> tune -> test -> tune...
When you place the system in production get ready for surprises as your prediction of data patterns will never be very accurate. This means that you (or whoever is the production DBA) may need to think on his/her feet and create some indexes on the fly or apply other tricks of the trade.
Good luck
I'm in the same situation now, here's my approach (using SQL Server 2008):
Create a separate "Numbers" table with millions of rows of sample data. The table may have random strings, GUIDs, numerical values, etc.
Write a procedure to insert the sample data into your schema. Use modulus (%) of a number column to simulate different UserIDs, etc.
Create another "NewData" table similar to the first table. This can be used to simulate new records being added.
Create a "TestLog" table where you can record rowcount, start time and end time for your test queries.
Write a stored procedure to simulate the workflow you expect your application to perform, using new or existing records as appropriate.
If performance seems fast, consider the probability of a cache miss! For example, if your production server has 32GB RAM, and your table is expected to be 128GB, a random row lookup is >75% likely to not be found in the buffer cache.
To simulate this, you can clear the cache before running your query:
DBCC DROPCLEANBUFFERS;
(If Oracle: ALTER SYSTEM FLUSH SHARED POOL)
You may notice a 100x slowdown in performance as indexes and data pages must now be loaded from disk.
Run SET STATISTICS IO ON; to gather query statistics. Look for cases where the number of logical reads is very high (> 1000) for a query. This is usually a sign of a full table scan.
Use the standard techniques to understand your query access patterns (scans vs. seek) and tune performance.
Include Actual Execution plan, SQL Server Profiler

Are stored procedures required for large data sets?

I've just started my first development job for a reasonably sized company that has to manage a lot of data. An average database is 6gb (from what I've seen so far). One of the jobs is reporting. How it's done currently is -
Data is replicated and transferred onto a data warehouse. From there, all the data required for a particular report is gathered (thousands of rows and lots of tables) and aggregated to a reports database in the warehouse. This is all done with stored procedures.
When a report is requested, a stored procedure is invoked which copies the data onto a reports database which PHP reads from to display the data.
I'm not a big fan of stored procs at all. But the people I've spoken to insist that stored procedures are the only option, as queries directly against the data via a programming language are incredibly slow (think 30 mins?). Security is also a concern.
So my question is - are stored procedures required when you have a very large data set? Do queries really take that long on such a large amount of data or is there a problem with either the DB servers or how the data is arranged (and indexed?). I've got a feeling that something is wrong.
The reasoning behind using a stored procedure is that the execution plan that is created in order to execute your procedure is cached by SQL Server in an area of memory known as the Plan Cache. When the procedure is then subsequently re-run at a later time, the execution plan has the possibility of being re-used.
A stored procedure will not run any faster than the same query, executed as a batch of T-SQL. It is the execution plans re-use that result in a performance improvement. The query cost will be the same for the actual T-SQL.
Offloading data to a reporting database is a typical pursuit however you may need to review your indexing strategy on the reporting database as it will likely need to be quite different from that of your OLTP platform for example.
You may also wish to consider using SQL Server Analysis Services in order to service your reporting requirements as it sounds like your reports contain lots of data aggregations. Storing and processing data for the purpose of fast counts and analytics is exactly what SSAS is all about. It sounds like it is time for your business to look as building a data warehouse.
I hope this helps but please feel free to request further details.
Cheers, John
In the context in which you are operating - large corporate database accessed in several places - it is virtually always best to place as much business logic inside the database as is possible.
In this case your immediate performance benefits are :
Firstly because if the the SP involves any processing beyond a simple select the processing of the data within the database can be orders of magnitude faster than sending rows across the network to your program for handling there.
You do acquire some benefits in that the SP is stored compiled. This is usually marginal compared to 1. if processing large volumes
However, and in my mind often more important than performance, is the fact that with corporate databases encapsulating the logic inside the database itself provides major management and maintenance benefits:-
Data structures can be abstracted away from program logic, allowing database structures to change without requiring changes to programs accessing the data. Anyone who has spent hours grep'ing a corporate codebase for SQL using [mytable] before making a simple database change will appreciate this.
SPs can provide a security layer, although this can be overused and overrelied on.
You say this is your first job for a company with a database of this type, so you can be forgiven for not appreciating how a database-centric approach to handling the data is really essential in such environments. You are not alone either - in a recent podcast Jeff Attwood said he wasn't a fan of putting code into databases. This is a fine and valid opinion where you are dealing with a database serving a single application, but is 100% wrong with a database used across a company by several applications, where the best policy is to screw down the data with a full complement of constraints and use SPs liberally for access and update.
The reason for this is if you don't such databases always lose data integrity and accumulate crud. Sometimes it's virtually impossible to imagine how they do, but in any large corporate database (tens of millions of records) without sufficient constraints there will be badly formed records - at best these force a periodic clean-up of data (a task I regularly used to get dumped with as a junior programmer), or worse will cause applications to crash due to invalid inputs, or even worse not cause them to crash but deliver incorrect business information to the end-users. And if your end user is your finance director then that's your job on the line :-)
It seems to me that there is an additional step in there that, based on your description, appears unneccessary. Here is what I am referring to -
When a report is requested, a stored
procedure is invoked which gathers the
data into a format required for a
report, and forwarded to another
stored procedure which transforms the
data into a view, and forwards THAT
off to a PHP framework for display.
A sproc transforms the data for a report, then another sproc transforms this data into another format for front-end presentation - is the data ever used in the format in which it is in after the first sproc? If not, that stage seems unneccessary to me.
I'm assuming that your reports database is a data warehouse and that data is ETL'ed and stored within in a format for the purposes of reporting. Where I currently work, this is common practice.
As for your question regarding stored procedures, they allow you to centralize logic within the database and "encapsulate" security, the first of which would appear to be of benefit within your organisation, given the other sprocs that you have for data transformation. Stored procedures also have a stored execution plan which, under some circumstances, can provide some improvement to performance.
I found that stored procedures help with large data sets because they eliminate a ton of network traffic, which can be a huge performance bottleneck depending on how large the data set actually is.
When processing large numbers of rows, where indexes are available and the SQL is relatively tuned, the database engine performing set-based operations directly on the data - through SQL, say - will almost always outperform row-by-row processing (even on the same server) in a client tool. The data is not crossing any physical or logical boudaries to leave the database server processes or to leave the database server and go out across the network. Even performing RBAR (row by agonizing row) on the server will be faster than performing it in a client tool, if only a limited amount of data really needs to ever leave the server, because...
When you start to pull more data across networks, then the process will slow down and limiting the number of rows at each stage becomes the next optimization.
All of this really has nothing to do with stored procedures. Stored procedures (in SQL Server) no longer provide much performance advantages over batch SQL. Stored procedures do provide a large number of other benefits like modularization, encapsulation, security management, design by contract, version management. Performance, however is no longer an advantage.
Generally speaking stored procedures have a number of advantages over direct queries. I can't comment on your complete end to end process, however, SPs will probably perform faster. For a start a direct query needs to be compiled and an execution plan worked out every time you do a direct query - SPs don't.
There are other reasons, why you would want to use stored procedure - centralisation of logic, security etc.
The end to end process does look a little complicated but there may be good reasons for it simply due to the data volume - it might well be that if you run the reports on the main database, the queries are slowing down the rest of the system so much that you'll cause problems for the rest of the users.
Regarding the stored procedures, their main advantage in a scenario like this is that they are pre-compiled and the database has already worked out what it considers to be the optimal query plan. Especially with the data volumes you are talking about, this might well result in a very noticeable performance improvement.
And yes, depending on the complexity of the report, a query like this can take half an hour or longer...
This reporting solution seems to have been designed by people that think the database is the centre of the world. This is a common and valid view – however I don’t always hold to it.
When moving data between tables/databases, it can be a lot quicker to use stored procs, as the data does not need to travel between the database and the application. However in most cases, I would rather not use stored proc as they make development more complex, I am in the ORM camp myself. You can sometimes get great speedups by loading lots into RAM and processing it there, however that is a totally different way of coding and will not allow the reuse of the logic that is already in the stored procs. Sorry I think you are stack with stored proc while in that job.
Giving the amount of data being moved about, if using SQL server I would look at using SSIS or DTS – oracle will have something along the same line. SSIS will do the data transformations on many threads while taking care of a lot of the details for you.
Remember the design of software has more to do with the history of the software and the people working it in, than it has to do with the “right way of doing it”. Come back in 100 years and we may know how to write software, at present it is mostly a case of the blind leading the blind. Just like when the first bridges were build and a lot of them fell down, no one could tell you in advance witch bridge would keep standing and why.
Unlike autogenerated code from an ORM product, stored procs can be performance tuned. This is critical in large production environment. There are many ways to tweak performance that are not available when using an ORM. Also there are many many tasks performed by a large database which have nothing to do with the user interface and thus should not be run from code produced from there.
Stored procs are also required if you want to control rights so that the users can only do the procedures specified in the proc and nothing else. Otherwise, users can much more easily make unauthorized changes to the databases and commit fraud. This is one reason why database people who work with large business critical systems, do not allow any access except through stored procs.
If you are moving large amounts of data to other servers though, I would consider using DTS (if using SQL Server 2000) or SSIS. This may speed up your processes still further, but it will depend greatly on what you are doing and how.
The fact that sps may be faster in this case doesn't preclude that indexing may be wrong or statistics out of date, but generally dbas who manage large sets of data tend to be pretty on top of this stuff.
It is true the process you describe seems a bit convoluted, but without seeing the structure of what is happening and understanding the database and environment, I can't say if maybe this is the best process.
I can tell you that new employees who come in and want to change working stuff to fit their own personal predjudices tend to be taken less than seriously and then you will have little credibility when you do need to suggest a valid change. This is particularly true when your past experience is not with databases of the same size or type of processing. If you were an expert in large systems, you might be taken more seriously from the start, but, face it, you are not and thus your opinion is not likely to sway anybody until you have been there awhile and they have a measure of your real capabilities. Plus if you learn the system as it is and work with it as it is, you will be in a better position in six months or so to suggest improvements rather than changes.
I could perhaps come up with more, but a few points.
Assuming a modern DB, stored procedures probably won't actually be noticeably faster than normal procedures due to caching and the like.
The security benefits of Stored procedures are somewhat overrated.
Change is evil. Consistency is king.
I'd say #3 trumps all other concerns unless stored procedures are causing a legitimate problem.
The faster way for reporting is to just read all data into memory (64 bit OS required) and just walk the objects. This is of course limited to ram size (affordable 32 GB) and reports where you hit a large part of the db. No need to make the effort for small reports.
In the old days I could run a report querying over 8 million objects in 1.5 seconds. That was in about a gigabyte of ram on a 3GHz pentium 4. 64 bit should be about twice as slow, but that is compensated by faster processors.

which db should i select if performance of postgres is low

In a web app that support more than 5000 users, postgres is becoming the bottle neck.
It takes more than 1 minute to add a new user.(even after optimizations and on Win 2k3)
So, as a design issue, which other DB's might be better?
Most likely, it's not PostgreSQL, it's your design. Changing shoes most likely will not make you a better dancer.
Do you know what is causing slowness? Is it contention, time to update indexes, seek times?
Are all 5000 users trying to write to the user table at the same exact time as you are trying to insert 5001st user? That, I can believe can cause a problem. You might have to go with something tuned to handling extreme concurrency, like Oracle.
MySQL (I am told) can be optimized to do faster reads than PostgreSQL, but both are pretty ridiculously fast in terms of # transactions/sec they support, and it doesn't sound like that's your problem.
P.S.
We were having a little discussion in the comments to a different answer -- do note that some of the biggest, storage-wise, databases in the world are implemented using Postgres (though they tend to tweak the internals of the engine). Postgres scales for data size extremely well, for concurrency better than most, and is very flexible in terms of what you can do with it.
I wish there was a better answer for you, 30 years after the technology was invented, we should be able to make users have less detailed knowledge of the system in order to have it run smoothly. But alas, extensive thinking and tweaking is required for all products I am aware of. I wonder if the creators of StackOverflow could share how they handled db concurrency and scalability? They are using SQLServer, I know that much.
P.P.S.
So as chance would have it I slammed head-first into a concurrency problem in Oracle yesterday. I am not totally sure I have it right, not being a DBA, but what the guys explained was something like this: We had a large number of processes connecting to the DB and examining the system dictionary, which apparently forces a short lock on it, despite the fact that it's just a read. Parsing queries does the same thing.. so we had (on a multi-tera system with 1000s of objects) a lot of forced wait times because processes were locking each other out of the system. Our system dictionary was also excessively big because it contains a separate copy of all the information for each partition, of which there can be thousands per table. This is not really related to PostgreSQL, but the takeaway is -- in addition to checking your design, make sure your queries are using bind variables and getting reused, and pressure is minimal on shared resources.
Please change the OS under which you run Postgres - the Windows port, though immensely useful for expanding the user base, is still not on a par with the (much older and more mature) Un*x ports (and especially the Linux one).
Ithink your best choice is still PostgresSQL. Spend the time to make sure you have properly tuned your application. After your confident you have reached the limits of what can be done with tuning, start cacheing everything you can. After that, start think about moving to an asynchronous master slave setup...Also are you running OLAP type functionality on the same database your doing OLTP on?
Let me introduce you to the simplest, most practical way to scale almost any database server if the database design is truly optimal: just double your ram for an instantaneous boost in performance. It's like magic.
PostgreSQL scales better than most, if you are going to stay with a relational db, Oracle would be it. ODBMS scale better but they have their own issues, as in that it is closer to programming to set one up.
Yahoo uses PostgreSQL, that should tell you something about is scalability.
As highlighted above the problem is not with the particular database you are using, i.e. PostgreSQL but one of the following:
Schema design, maybe you need to add, remove, refine your indexes
Hardware maybe you are asking to much of your server - you said 5k users but then again very few of them are probably querying the db at the same time
Queries: perhaps poorly defined resulting in lots of inefficiency
A pragmatic way to find out what is happening is to analyse the PostgeSQL log files and find out what queries in terms of:
Most frequently executed
Longest running
etc. etc.
A quick review will tell you where to focus your efforts and you will most likely resolve your issues fairly quickly. There is no silver bullet, you have to do some homework but this will be small compared to changing your db vendor.
Good news ... there are lots of utilities to analayse your log files that are easy to use and produce easy to interpret results, here are two:
pgFouine - a PostgreSQL log analyzer (PHP)
pgFouine: Sample report
PQA (ruby)
PQA: Sample report
First, I would make sure the optimizations are, indeed, useful. For example, if you have many indexes, sometimes adding or modifying a record can take a long time.
I know there are several big projects running over PostgreSQL, so take a look at this issue.
I'd suggest looking here for information on PostgreSQL's performance: http://enfranchisedmind.com/blog/2006/11/04/postgres-for-the-win
What version of PG are you running? As the releases have progressed, performance has improved greatly.
Hi had the same issue previously with my current company. When I first joined them, their queries were huge and very slow. It takes 10 minutes to run them. I was able to optimize them to a few milliseconds or 1 to 2 seconds. I have learned many things during that time and I will share a few highlights in them.
Check your query first. doing an inner join of all the tables that you need will always take sometime. One thing that I would suggest is always start off with the table with which you can actually cut your data to those that you need.
e.g. SELECT * FROM (SELECT * FROM person WHERE person ilike '%abc') AS person;
If you look at the example above, this will cut your results to something that you know you need and you can refine them more by doing an inner join. This is one of the best way to speed up your query but there are more than one way to skin a cat. I cannot explain all of them here because there are just too many but from the example above, you just need to modify that to suite your need.
It depends on your postgres version. Older postgres does internally optimize the query. on example is that on postgres 8.2 and below, IN statements are slower than 8.4's.
EXPLAIN ANALYZE is your friend. if your query is running slow, do an explain analyze to determine which of it is causing the slowness.
Vacuum your database. This will ensure that statistics on your database will almost match the actual result. Big difference in the statistics and actual will result on your query running slow.
If all of these does not help you, try modifying your postgresql.conf. Increase the shared memory and try to experiment with the configuration to better suite your needs.
Hope this helps, but of course, these are just for postgres optimization.
btw. 5000 users are not much. My DB contains users with about 200k to a million users.
If you do want to switch away from PostgreSQL, Sybase SQL Anywhere is number 5 in terms of price/performance on the TPC-C benchmark list. It's also the lowest price option (by far) on the top 10 list, and is the only non-Microsoft and non-Oracle entry.
It can easily scale to thousands of users and terabytes of data.
Full disclosure: I work on the SQL Anywhere development team.
We need more details: What version you are using? What is the memory usage of the server? Are you vacuuming the database? Your performance problems might be un-related to PostgreSQL.
If you have many reads over writes, you may want to try MySQL assuming that the problem is with Postgres, but your problem is a write problem.
Still, you may want to look into your database design, and possibly consider sharding. For a really large database you may still have to look at the above 2 issues regardless.
You may also want to look at non-RDBMS database servers or document oriented like Mensia and CouchDB depending on the task at hand. No single tool will manage all tasks, so choose wisely.
Just out of curiosity, do you have any stored procedures that may be causing this delay?

Resources