Related
I have created a prototype of a custom ORM tool using aspect oriented programming (PostSHarp) and achieving persistence ignorance (before compile-time). Now I tried to find out how much overhead does it introduce compared to using pure DataReader and ADO.NET. I made a test case - insert, read, delete data (about 1000 records) in MS SQL Server 2008 and MySQL Community Edition. I run this test multiple times using pure ADO.NET and my custom tool.
I expected that results will depend on many factors - memory, swapping, CPU, other processes so I ran tests for many times (20-40). But the results were really unexpected. They just differed too much between those cases. If there were just some extreme values, I could ignore them (maybe swapping ocurred or smth. like that) but they were so different that I am sure I cannot trust this kind of testing. Almost half of times my ORM showed 10% better performance than pure ADO.NET, other times it was -10%.
Is there any way I can make those tests reliable? I do not have a powerful computer with lots of memory, but maybe I somehow can make MS SQL and MySQL or ADO.NET to be as consistent as possible during those tests? And how about count of records - which is more reliable, using small amount of records and running more times or other way?
Have you seen ORMBattle.NET? See FAQ there, there are some ideas related to measuring performance overhead introduced by a particular ORM tool. Test suite is open source.
Concerning your results:
Some ORM tools automatically batch statement sequences (i.e. send several SQL statements together). If this feature is implemented well in ORM, it's easy to beat plain ADO.NET by 2-4 times on CRUD operations, if ADO.NET test does not involve batching. Tests on ORMBattle.NET test both cases.
A lot depends on how you establish transaction boundaries there. Please refer to ORMBattle.NET FAQ for details.
CRUD tests aren't best performance indicator at all. In general, it's pretty easy to get
peak possible performance here, since in general, RDBMS must do much more than ORM in this case.
P.S. I'm one of ORMBattle.NET authors, so if you're interested in details / possible contributions, you can contact me directly (or join ORMBattle.NET Google Groups).
I would run the test for a longer duration and with many more iterations as small differences would average out over time and you should get a clearer picture. Also, make sure you eliminate any external things that may be affecting your test, such as other processes running, non enough free memory, cold start vs warm start, network usage, etc.
Also, make sure that your database file and log file have enough free space allocated so you aren't waiting for the DB to grow the file during certain tests.
First of all you need to find out where does the variance come from. The ORM layer itself or the database?
Many times the source of such variance is the database itself. Databases are very complex systems, with many active processes inside that can interact with the result of performance measurements. To achieve some reproductible results you'll have to place your database under 'laboratory' conditions and make sure nothing unexpected happens. what that means depends from vendor to vendor and you need know some pretty advanced topics in order to tacle something like this. For instance, on a SQL Server database the typical sources of variation are:
cold cache vs. warm cache (both data and procedures)
log and database growth events
maintenance jobs
ghost cleanup
lazy writer
checkpoints
external memory pressure
We have a database with ~100K business objects in it. Each object has about 40 properties which are stored amongst 15 tables. I have to get these objects, perform some transforms on them and then write them to a different database (with the same schema.)
This is ADO.Net 3.5, SQL Server 2005.
We have a library method to write a single property. It figures out which of the 15 tables the property goes into, creates and opens a connection, determines whether the property already exists and does an insert or update accordingly, and closes the connection.
My first pass at the program was to read an object from the source DB, perform the transform, and call the library routine on each of its 40 properties to write the object to the destination DB. Repeat 100,000 times. Obviously this is egregiously inefficent.
What are some good designs for handling this type of problem?
Thanks
This is exactly the sort of thing that SQL Server Integration Services (SSIS) is good for. It's documented in Books Online, same as SQL Server is.
Unfortunately, I would say that you need to forget your client-side library, and do it all in SQL.
How many times do you need to do this? If only once, and it can run unattended, I see no reason why you shouldn't reuse your existing client code. Automating the work of human beings is what computers are for. If it's inefficient, I know that sucks, but if you're going to do a week of work setting up a SSIS package, that's inefficient too. Plus, your client-side solution could contain business logic or validation code that you'd have to remember to carry over to SQL.
You might want to research Create_Assembly, moving your client code across the network to reside on your SQL box. This will avoid network latency, but could destabilize your SQL Server.
Bad news: you have many options
use flatfile transformations: Extract all the data into flatfiles, manipulate them using grep, awk, sed, c, perl into the required insert/update statements and execute those against the target database
PRO: Fast; CON: extremly ugly ... nightmare for maintanance, don't do this if you need this for longer then a week. And a couple dozens of executions
use pure sql: I don't know much about sql server, but I assume it has away to access one database from within the other, so one of the fastes ways to do this is to write it as a collection of 'insert / update / merge statements fed with select statements.
PRO: Fast, one technology only; CON: Requires direct connection between databases You might reach the limit of SQL or the available SQL knowledge pretty fast, depending on the kind of transformation.
use t-sql, or whatever iterative language the database provides, everything else is similiar to pure sql aproach.
PRO: pretty fast since you don't leave the database CON: I don't know t-sql, but if it is anything like PL/SQL it is not the nicest language to do complex transformation.
use a high level language (Java, C#, VB ...): You would load your data into proper business objects manipulate those and store them in the database. Pretty much what you seem to be doing right now, although it sounds there are better ORMs available, e.g. nhibernate
use a ETL Tool: There are special tools for extracting, transforming and loading data. They often support various databases. And have many strategies readily available for deciding if an update or insert is in place.
PRO: Sorry, you'll have to ask somebody else for that, I so far have nothing but bad experience with those tools.
CON: A highly specialized tool, that you need to master. I my personal experience: slower in implementation and execution of the transformation then handwritten SQL. A nightmare for maintainability, since everything is hidden away in proprietary repositories, so for IDE, Version Control, CI, Testing you are stuck with whatever the tool provider gives you, if any.
PRO: Even complex manipulations can be implemented in a clean maintainable way, you can use all the fancy tools like good IDEs, Testing Frameworks, CI Systems to support you while developing the transformation.
CON: It adds a lot of overhead (retrieving the data, out of the database, instanciating the objects, and marshalling the objects back into the target database. I'd go this way if it is a process that is going to be around for a long time.
Building on the last option you could further glorify the architectur by using messaging and webservices, which could be relevant if you have more then one source database, or more then one target database. Or you could manually implement a multithreaded transformer, in order to gain through put. But I guess I am leaving the scope of your question.
I'm with John, SSIS is the way to go for any repeatable process to import large amounts of data. It should be much faster than the 30 hours you are currently getting. You could also write pure t-sql code to do this if the two database are on the same server or are linked servers. If you go the t-sql route, you may need to do a hybrid of set-based and looping code to run on batches (of say 2000 records at a time) rather than lock up the table for the whole time a large insert would take.
I'm trying to populate a table with user information in a MS SQL database with information from multiple data sources (i.e. LDAP and some other MS SQL databases). The process needs to run as a daily scheduled task to ensure that the user information table is updated frequently.
The initial attempt at this query/ update script was written in VBScript and would query each data source and then update the user information table. Unfortunately, this takes very long to run and update the user information table.
I'm curious if anyone has written anything similar and if you recommend or noticed a performance improvement by writing the script in another language. Some have recommended Perl because of multi-threading, but if anyone has any other suggestions on ways to improve the process or other approaches could you share tips or lessons learned.
It's good practise to use Data Transformation Services (DTS) or SSIS as it has become known for doing repetitive DB tasks. Although this won't solve your problem, it may give some pointers to what is going on as you can log each stage of the process, wrap it in transactions etc. It is especially well suited for bulk loading and updates, and it understands VBScript natively so there should be no problem there.
Other than that I have to agree with Brian, find out what's making it slow and fix that, changing languages is unlikely to fix it on its own, especially if you have an underlying issue. As a general point my experience when using LDAP, which is pretty small, was it could be incredibly slow reader bulk user details.
I can't tell you how to solve your particular problem, but whenever you run into this situation you want to find out why it is slow before you try to solve it. Where is the slow down? Some major things to consider and investigate include:
getting the data
interacting with the network
querying the database
updating indices in the database
Get some timing and profiling information to figure out where to concentrate your efforts.
Hmmm. Seems like you could cron a script that uses dump utils from the various sources, then seds the output into good form for the load util for the target database. The script could be in bash or Perl, whatever.
Edit: In terms of performance, I think the first thing you want to try is to make sure that you disable any autocommit at the beginning of the load process, then issue the commit after writing all the records. This can make a HUGE performance difference.
AS MrTelly said, use SSIS or DTS. Then schedule the package to run. Just converting to this alone will probaly fix your speed issue as they have tasks that are optimized for bulk inserting. I would never do this in a script language rather that t-SQl anyway. Likely your script works row by row instead of on sets of data but that is just a guess.
In a web app that support more than 5000 users, postgres is becoming the bottle neck.
It takes more than 1 minute to add a new user.(even after optimizations and on Win 2k3)
So, as a design issue, which other DB's might be better?
Most likely, it's not PostgreSQL, it's your design. Changing shoes most likely will not make you a better dancer.
Do you know what is causing slowness? Is it contention, time to update indexes, seek times?
Are all 5000 users trying to write to the user table at the same exact time as you are trying to insert 5001st user? That, I can believe can cause a problem. You might have to go with something tuned to handling extreme concurrency, like Oracle.
MySQL (I am told) can be optimized to do faster reads than PostgreSQL, but both are pretty ridiculously fast in terms of # transactions/sec they support, and it doesn't sound like that's your problem.
P.S.
We were having a little discussion in the comments to a different answer -- do note that some of the biggest, storage-wise, databases in the world are implemented using Postgres (though they tend to tweak the internals of the engine). Postgres scales for data size extremely well, for concurrency better than most, and is very flexible in terms of what you can do with it.
I wish there was a better answer for you, 30 years after the technology was invented, we should be able to make users have less detailed knowledge of the system in order to have it run smoothly. But alas, extensive thinking and tweaking is required for all products I am aware of. I wonder if the creators of StackOverflow could share how they handled db concurrency and scalability? They are using SQLServer, I know that much.
P.P.S.
So as chance would have it I slammed head-first into a concurrency problem in Oracle yesterday. I am not totally sure I have it right, not being a DBA, but what the guys explained was something like this: We had a large number of processes connecting to the DB and examining the system dictionary, which apparently forces a short lock on it, despite the fact that it's just a read. Parsing queries does the same thing.. so we had (on a multi-tera system with 1000s of objects) a lot of forced wait times because processes were locking each other out of the system. Our system dictionary was also excessively big because it contains a separate copy of all the information for each partition, of which there can be thousands per table. This is not really related to PostgreSQL, but the takeaway is -- in addition to checking your design, make sure your queries are using bind variables and getting reused, and pressure is minimal on shared resources.
Please change the OS under which you run Postgres - the Windows port, though immensely useful for expanding the user base, is still not on a par with the (much older and more mature) Un*x ports (and especially the Linux one).
Ithink your best choice is still PostgresSQL. Spend the time to make sure you have properly tuned your application. After your confident you have reached the limits of what can be done with tuning, start cacheing everything you can. After that, start think about moving to an asynchronous master slave setup...Also are you running OLAP type functionality on the same database your doing OLTP on?
Let me introduce you to the simplest, most practical way to scale almost any database server if the database design is truly optimal: just double your ram for an instantaneous boost in performance. It's like magic.
PostgreSQL scales better than most, if you are going to stay with a relational db, Oracle would be it. ODBMS scale better but they have their own issues, as in that it is closer to programming to set one up.
Yahoo uses PostgreSQL, that should tell you something about is scalability.
As highlighted above the problem is not with the particular database you are using, i.e. PostgreSQL but one of the following:
Schema design, maybe you need to add, remove, refine your indexes
Hardware maybe you are asking to much of your server - you said 5k users but then again very few of them are probably querying the db at the same time
Queries: perhaps poorly defined resulting in lots of inefficiency
A pragmatic way to find out what is happening is to analyse the PostgeSQL log files and find out what queries in terms of:
Most frequently executed
Longest running
etc. etc.
A quick review will tell you where to focus your efforts and you will most likely resolve your issues fairly quickly. There is no silver bullet, you have to do some homework but this will be small compared to changing your db vendor.
Good news ... there are lots of utilities to analayse your log files that are easy to use and produce easy to interpret results, here are two:
pgFouine - a PostgreSQL log analyzer (PHP)
pgFouine: Sample report
PQA (ruby)
PQA: Sample report
First, I would make sure the optimizations are, indeed, useful. For example, if you have many indexes, sometimes adding or modifying a record can take a long time.
I know there are several big projects running over PostgreSQL, so take a look at this issue.
I'd suggest looking here for information on PostgreSQL's performance: http://enfranchisedmind.com/blog/2006/11/04/postgres-for-the-win
What version of PG are you running? As the releases have progressed, performance has improved greatly.
Hi had the same issue previously with my current company. When I first joined them, their queries were huge and very slow. It takes 10 minutes to run them. I was able to optimize them to a few milliseconds or 1 to 2 seconds. I have learned many things during that time and I will share a few highlights in them.
Check your query first. doing an inner join of all the tables that you need will always take sometime. One thing that I would suggest is always start off with the table with which you can actually cut your data to those that you need.
e.g. SELECT * FROM (SELECT * FROM person WHERE person ilike '%abc') AS person;
If you look at the example above, this will cut your results to something that you know you need and you can refine them more by doing an inner join. This is one of the best way to speed up your query but there are more than one way to skin a cat. I cannot explain all of them here because there are just too many but from the example above, you just need to modify that to suite your need.
It depends on your postgres version. Older postgres does internally optimize the query. on example is that on postgres 8.2 and below, IN statements are slower than 8.4's.
EXPLAIN ANALYZE is your friend. if your query is running slow, do an explain analyze to determine which of it is causing the slowness.
Vacuum your database. This will ensure that statistics on your database will almost match the actual result. Big difference in the statistics and actual will result on your query running slow.
If all of these does not help you, try modifying your postgresql.conf. Increase the shared memory and try to experiment with the configuration to better suite your needs.
Hope this helps, but of course, these are just for postgres optimization.
btw. 5000 users are not much. My DB contains users with about 200k to a million users.
If you do want to switch away from PostgreSQL, Sybase SQL Anywhere is number 5 in terms of price/performance on the TPC-C benchmark list. It's also the lowest price option (by far) on the top 10 list, and is the only non-Microsoft and non-Oracle entry.
It can easily scale to thousands of users and terabytes of data.
Full disclosure: I work on the SQL Anywhere development team.
We need more details: What version you are using? What is the memory usage of the server? Are you vacuuming the database? Your performance problems might be un-related to PostgreSQL.
If you have many reads over writes, you may want to try MySQL assuming that the problem is with Postgres, but your problem is a write problem.
Still, you may want to look into your database design, and possibly consider sharding. For a really large database you may still have to look at the above 2 issues regardless.
You may also want to look at non-RDBMS database servers or document oriented like Mensia and CouchDB depending on the task at hand. No single tool will manage all tasks, so choose wisely.
Just out of curiosity, do you have any stored procedures that may be causing this delay?
LINQ simplifies database programming no doubt, but does it have a downside? Inline SQL requires one to communicate with the database in a certain way that opens the database to injections. Inline SQL must also be syntax-checked, have a plan built, and then executed, which takes precious cycles. Stored procedures have also been a rock-solid standard in great database application programming. Many programmers I know use a data layer that simplifies development, however, not to the extent LINQ does. Is it time to give up on the SP's and go LINQ?
LINQ to SQL actually presents some alarming performance problems in the database. Basically, it creates multiple execution plans based on the length of the parameter you are using. I posted about it a while back on my blog LINQ to SQL may cause performance problems.
Now, is that to say that LINQ doesn't have a place? Hardly. LINQ definitely has a place in the development toolkit, just like stored procedures. Ultimately, you want to use stored procedures when performance is absolutely necessary and use an ORM tool in any other situation.
As far as inline SQL goes, there are ways to execute inline SQL so that the plan is only built once and is never recompiled. Most ORMs should take care of this aspect of performance tuning as well and using these methods is usually the safest way to execute your SQL since it forces you to use parameterized queries.
Like most database solutions, the right answer depends on the problem you're trying to solve. If you favor development speed over database/application performance, then using LINQ or another DAL/ORM tool is the best way to go. If you favor performance over ease of development, then using stored procedures and pure datasets is going to be your best bet. LLBLGen even provides a LINQ to LLBLGen layer so you can use LINQ to query LLBLGen's objects and have LLBLGen actually handle building your queries and avoid some of the downfalls of LINQ.
Your basic premise is flawed..
Inline SQL requires one to communicate with the database in a certain way that opens the database to injections.
No it doesn't. Hard-coding user-inputted values into a SQL statement does, but you could do that with store procedures as well.
Parameterizing your queries guards against injection attacks, but inline SQL can be parameterizing just as easily as stored procedures.
Inline SQL must also be syntax-checked, have a plan built, and then executed.
All Sql (SPs and inline) must be syntax-checked and have a plan built on their first call. Thereafter, the exact text of the request & the execution plan are cached. If another request with the exact same text (not counting parameters) is received, the cached execution plan is used.
So, if you hard-code values into inline SQL, the text won't match, and it will have to re-parse the query. However, if you use parameters, the text of the query will match, and you will get a cache hit. In which case, it wouldn't matter if the query in inline SQL or a SP.
In other words, the only problem with inline SQL is that it easy to do something that slow & insecure. But making inline SQL fast & secure is no more work that using a SP.
Which brings us to LINQ, which always using parameters, even if you hard-code the values into the LINQ statement, making "fast & secure" inline SQL trivial.
LINQ also have the advantage over SPs of having all your code in one spot, instead of scattered over two different machines.
If you're interested in benchmarking, Rico Mariani has an excellent 5-part study that covers the qualitative and quantitative differences.
He may be an MS guy, but he's known as a performance nut - his benchmarks are thorough and well thought out.
This is a performance run by Maximilian Beller. According to him, LINQ is much much slower.
Read his comprehensive study
Just think about changing a columns name - now change the (n)SPs and (x)Views.
Do everything that is expensive on the database (like searches , sorting etc..) and you won't notice a problem.
Also, if you want to display a large grid without paging ... then use a dataset - that one is faster.
StackOverflow also uses linq2sql - do you see a problem :) ?
Use an ORM - it's the way to go on most applications.
PS: also, about micro benchmarks - like .. let's select 10.000 rows with an ORM - DON'T DO IT. That's not why you use an ORM. If you want to select 10.000 rows use ADO.
It depends on what you're doing. LINQ is going to be less efficient at the actual data/set manipulation than a real database. But you'll save a lot in not having to connect to the database over a network.
If your database is on the same machine or is formally 'well-connected', you're probably better off using it.
But if you're getting back a large result set from a remote db that could mean significant transmission time, or if it's a really short query that won't justify the overhead, LINQ would likely be better.
Because of the structure of LINQ to SQL, there is no possible way it can be faster than using raw SQL, either your own well-formed queries or as a stored procedure. What LINQ buys you is not speed but type safety and organization; in short most of the benefits that ORMs generally grant you.
LINQ to SQL is not about speed, it's about building a more maintainable software system. It's about all the stuff dedicated Software Engineers and Architects care about, stuff like loose coupling and layering
That's not to say that you can't build some really unmaintainable code with LINQ -- nobody is keeping you from shooting yourself in the foot but you -- but done properly, LINQ can help tremendously. I'm not saying LINQ is a silver bullet, however. It has a host of issues that make it difficult to use in many enterprise situations -- which is why MS offers Entity Framework (ADO.NET 3.0). Of course, even that's not perfect given the recent EF Vote of No Confidence.
Is LINQ to SQL or even EF better than raw SQL? I'd say a resounding Hells Yeah. Are there other solutions that might work better? Maybe.