Load and Performance Testing of a Database - database

This is the first time, my team has asked me to do some testing on Database which I have no clue how to approach. By testing on database I mean, I need to see how fast it can insert records into it. And till what pressure it can handle. Just like Load and Performance Testing for database. The database that we are about to use is XPRESSmp.
So can anyone help me in what kind of testing we usually do when we need to Test the database and what are the tools that I can take a look into for this. Most of the articles that I have seen where mostly related to Oracle and MySQL. But this is a new database altogether.
One approach I can think of is write a Multithreaded Program with X number of threads that will pump the data into XMP at very high speed. And keep on measuring how much time each thread is taking. What else I can do to test the database?
My team has asked me to break the database by doing your testing but we should know at what situation it broked and what was the reason behind that.
And what important points I should know and take into the consideration while doing the testing on database.
P.S I will be doing this testing in seperate LnP machines.

Usually, SysBench is used to test queries performance on MySQL. It is not just for MySQL, though. I have only a basic knowledge of it, so I suggest you don't ask me and read documentation:
http://sysbench.sourceforge.net/

You can use these tools:
HammerDB is an open source database load testing and benchmarking tool for Oracle, SQL Server, TimesTen, PostgreSQL, Greenplum, Postgres Plus Advanced Server, MySQL and Redis. HammerDB is automated, multi-threaded and extensible with dynamic scripting support. HammerDB includes complete built-in workloads based on industry standard benchmarks as well as capture and replay for the Oracle database.
Download or see more information visit http://hammerora.sourceforge.net/
p-unit
Description:
An open source framework for unit test and performance benchmark, which was initiated by Andrew Zhang, under GPL license. p-unit supports to run the same tests with single thread or multi-threads, tracks memory and time consumption, and generates the result in the form of plain text, image or pdf file.
http://p-unit.sourceforge.net/
DBMonster
Description:
DBMonster is an application to generate random data for testing SQL database driven applications under heavy load.
http://sourceforge.net/projects/dbmonster/

Replied here, use the k6 SQL extension.

Related

Transforming (Synchronizing) Data between SQL to HBase

We are overhauling our product by completely moving from Microsoft and .NET family to open source (well one of the reasons is cost cutting and exponential increase in data).
We plan to move our data model completely from SQL Server (relational data) to Hadoop (the famous key-Value pair ecosystem).
In the beginning, we want to support both versions (say 1.0 and new v2.0). In order to maintain the data consistency, we plan to sync the data between both systems, which is a fairly challenging task and error prone, but we don't have any other option.
A bit confused where to start from, I am looking up to the community of experts.
Any strategy/existing literature or any other kind of guidance in this direction would be greatly helpful.
I am not entirely sure how your code is structured, but if you currently have a data or persistence layer, or at least a database access class where all your SQL is executed through, you could override the save functions to write changes to both databases. If you do not have a data layer, you may want to considering writing one before starting the transition.
Otherwise, you could add triggers in MSSQL to update Hadoop, not sure what you can do in Hadoop to keep MSSQL in-sync.
Or, you could have a process that runs every x minutes, that manually syncs the two databases.
Personally, I would try to avoid trying to maintain two databases of record. Moving changes from a new, experimental database to your stable database seems risky. You stand the chance of corrupting your stable system. Instead, I would write a convertor to move data from your relational DB to Hadoop. Then every night or so, copy your data into Hadoop and use it for the development and testing of your new system. I think test users would understand if you said your beta version is just a test playground, and won't effect your live product. If you plan on making major changes to your UI and fear some will not want to transition to 2.0, then you might be trying to tackle too much at once.
Those are the solutions I came up with... Good luck!
Consider using a queuing tool like Flume (http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3b2-flume/) to split your input between both systems.

Testing database performance tools?

What do you know the best tool to testing database performance? I'm looking for a tool which help me find weak performance places in my db during use application.
There are at least two not obvious tools that can help you:
SoapUI has support for JDBC
JMeter has a JDBC sampler (don't miss these wonderful plugins!)
I said these tools are not obviouse because they are typically used for different targets (SOAP web services functional testing and HTTP accordingly). JMeter seems to be a bit better suited as it aimes for performance testing, but SoapUI can do this as well.
I'd just use SQL Server Profiler to capture a database-side trace, and then just sort by duration.
I do stuff like this 5 times a day.
Hope that helps
-Aaron MCITP: DBA
You could use a source code level profiler to profile the application, which accesses your database. Profilers can identify the slowest lines of code. Most profilers can filter their results by namespace or special naming patterns, so could could filter out all non-database access code. You can then have a look which database queries are made on these slow lines.
In some database systems you can set up logging to log which queries where run and how long they took. Database monitoring applications can show you which queries are running at the moment, so you can identify the slowest/most often executed queries very easy. If this is no option you can log the queries you are doing in your app into a text file and then run them manually against the database, usually the time taken is displayed to the user.
A good feature to optimize database queries is the EXPLAIN command supported by many DBMS.
If you told us which database exactly you are running we could help more.

Testing custom ORM solution performance overhead - how to?

I have created a prototype of a custom ORM tool using aspect oriented programming (PostSHarp) and achieving persistence ignorance (before compile-time). Now I tried to find out how much overhead does it introduce compared to using pure DataReader and ADO.NET. I made a test case - insert, read, delete data (about 1000 records) in MS SQL Server 2008 and MySQL Community Edition. I run this test multiple times using pure ADO.NET and my custom tool.
I expected that results will depend on many factors - memory, swapping, CPU, other processes so I ran tests for many times (20-40). But the results were really unexpected. They just differed too much between those cases. If there were just some extreme values, I could ignore them (maybe swapping ocurred or smth. like that) but they were so different that I am sure I cannot trust this kind of testing. Almost half of times my ORM showed 10% better performance than pure ADO.NET, other times it was -10%.
Is there any way I can make those tests reliable? I do not have a powerful computer with lots of memory, but maybe I somehow can make MS SQL and MySQL or ADO.NET to be as consistent as possible during those tests? And how about count of records - which is more reliable, using small amount of records and running more times or other way?
Have you seen ORMBattle.NET? See FAQ there, there are some ideas related to measuring performance overhead introduced by a particular ORM tool. Test suite is open source.
Concerning your results:
Some ORM tools automatically batch statement sequences (i.e. send several SQL statements together). If this feature is implemented well in ORM, it's easy to beat plain ADO.NET by 2-4 times on CRUD operations, if ADO.NET test does not involve batching. Tests on ORMBattle.NET test both cases.
A lot depends on how you establish transaction boundaries there. Please refer to ORMBattle.NET FAQ for details.
CRUD tests aren't best performance indicator at all. In general, it's pretty easy to get
peak possible performance here, since in general, RDBMS must do much more than ORM in this case.
P.S. I'm one of ORMBattle.NET authors, so if you're interested in details / possible contributions, you can contact me directly (or join ORMBattle.NET Google Groups).
I would run the test for a longer duration and with many more iterations as small differences would average out over time and you should get a clearer picture. Also, make sure you eliminate any external things that may be affecting your test, such as other processes running, non enough free memory, cold start vs warm start, network usage, etc.
Also, make sure that your database file and log file have enough free space allocated so you aren't waiting for the DB to grow the file during certain tests.
First of all you need to find out where does the variance come from. The ORM layer itself or the database?
Many times the source of such variance is the database itself. Databases are very complex systems, with many active processes inside that can interact with the result of performance measurements. To achieve some reproductible results you'll have to place your database under 'laboratory' conditions and make sure nothing unexpected happens. what that means depends from vendor to vendor and you need know some pretty advanced topics in order to tacle something like this. For instance, on a SQL Server database the typical sources of variation are:
cold cache vs. warm cache (both data and procedures)
log and database growth events
maintenance jobs
ghost cleanup
lazy writer
checkpoints
external memory pressure

Database connectivity Delphi

I'm using delphi for years, but never for database stuff, but recently started researching and testing.
I must say, i'm impressed, most of things happens automatically, i'm used to write by hand in php and python.
i'm going to develop a commercial system for a friend, (2 layers) 5 user computers, 1 database server.
Database server will be a decent machine with (raid-1) 2 hard drives running (MySql5.1 or Postgre or Firebird, open to suggestions).
ADO
Easy to use
Easy deployment (only mysqlconnector installer)
The slower?
DbExpress
Need to ship 4 files [dbxconnections.ini, dbxdrivers.ini, mysqldll, driverdll]
The more complex (harder to use)
ClientDataSet add complexity, but looks really useful
No free Postgre driver?
Zeos
Easy deployment (1 dll)
Easy to use
As you can see the desired features are:
fast
easy to use
easy to deploy
I can't test all in a real scenario (clients, server), so i hope that you guys with experience can help me out in which one to choose and why.
EDIT: Thanks everyone, i think i will go with ADO (probably) or Zeos
Thanks in advance
Arthur
#arthurprs, for you scenario
(2 layers) 5 user computers, 1 > database server.
alt text http://www.techsolusa.com/images/firebird-logo-64.gif The Firebird RDBMS is a very good option , because is very stable, fast, runs on Linux, Windows, and a variety of Unix platforms and meet with you requirements.
alt text http://d.yimg.com/kq/groups/12858579/homepage/name/homepage.jpg Respect to the components for connection i go for ZEOS.
I have used this combination in many small and medium projects, with excellent results.
I have worked on many commercial high volume systems using ADO without any problems. Deployment is relatively simple since its included in the OS. Since it has such a wide audience, most of the major issues have been identified and corrected. Getting help with ADO connectivity is very easy. The database support is very deep (connectionstrings.com) which make supporting additional database engines almost trivial (you may need to still install the client drivers, but that would be the same for almost any solution).
Performance isn't much of an issue, it really will come down to database architecture and engine selection.
id have to say im rather happy with NexusDB but the cost for the client/server versions might not be worth it.
it works client/server or fully embedded, simple enough you can have both in your app and switch between them, depending on your clients needs
the embedded DB is free,
client/server "Priced per developer" is AU$ 500
No cost per install.
Oh yeh and its written in delphi ;)
I'd say to go with Firebird - is the most used database engine in the Delphi land (see here). For connectivity perhaps is better to go with Zeos (free) or DBX (if you can afford the Architect version - the only one who has the Firebird driver in it).
About ADO: Mature connectivity layer but it will be (forever - most probably) tied to Windows while Delphi will go cross-platform. Also, yes, it tend to be the slower one because of many reasons, including the ODBC drivers which are used in certain situations. But in your case, of course, as skamradt says, I don't think that it will matter so much.
Although I have read people not liking the idea of mixing the two, I have had good results using ADO Datasets as a "provider layer" which then feeds the data into TClientDataSets - so there's no reason you can't use ClientDataSets if you go down the ADO route if you find you need them (and they are useful).
Otherwise, I would echo the comment that ADO is a tried, trusted mechanism that isn't going anywhere. I've always found it more than fast enough. And configuration using UDL files is nice and easy.
dbGo (ADO) is more simple to manage, more universal, more slow
dbExpress is more fast, more complex to manage, supports less DBMS's
ZeosDBO is simple to manage, universal like dbExp, slow like dbGo, cross-platform, has few additional components, all sources are accessible
There are few other libraries, resolving all above doubts, although all of them are commercial products. But there I am biased :)
We have used postgreSQL using Devart pg components with great success in medium sized database apps.
We did some limited benchmarking with this combination and found it 2-3 x the speed of using ADO etc.
-- Data access components
I too favour the combination of TClientDataset and ADO. Had worked with it in past and I can say it's trustful. The flexibility of TClientDataset is a big gain. DBExpress is good too.
Actually, I use clientdatasets with pretty much any data access layer that have an TDataset descendant...
-- Server
Firebird. Free and easily usable from OLEDB (I used with ODBC) and DBExpress (D2010+ have native DBX driver) - don't know ZEOS, but I believe that it also connect to FB.
Scale well to many connections and big databases. There are databases on Firebird with 500Gb and many users reported.

ETL Tools and Build Tools

I have familiarities with software automated build tools ( such as Automated Build Studio). Now I am looking at ETL tools.
The one thing crosses my mind is that, I can do anything I can do in ETL tools by using a software build tool. ETL tools are tailored for data loading and manipulation for which a lot of scripts are needed in order to do the job. Software build tool, on the other hand, is versatile enough to do any jobs, including writing scripts to extract, transform and load any data from any format into any format.
Am I right?
It is correct that you can roll-out your own ETL scripts written using a development tool of your preference. Having said that, ETL jobs are frequently large (for a lack of better word) and demand considerable administration and attention to minute details (like programming). ETL tools allow developer to focus on ETL tasks -- as opposed to writing and debugging code, although that's part of it too. There are some open-source tools out there, so you can get a feeling of what an average tool does, before jumping into custom development. For example, more expensive tools provide data lineage, meaning you can (graphically) track every field on a report back to the originating table through all transformations (versions included); after a corporate merger that's quite a task to do.
For example Pentaho has community edition; if you have MS SQL Server, you can get SSIS. Also see if you can find something here.
The benefit of an ETL tool is maximized if you have many processes to build (I like jsf80238's post aboves analogy with hammering in 100 nails). A key benefit of real ETL tools is the metadata they generate and operational support. Writing your scripts in Perl/Ruby/etc is fairly easy, but breaks down when problems need to be tracked down or someone other than the author has to figure out what's wrong.The ability for admin/support staff to quickly see what went wrong is what's worth paying money for. I have used Microsoft's SSIS (2005 - OK) and the latest Pentaho PDI (quite good). The Pentaho ETL GUI is used by business users (without IT support for 99% of the time) at my workplace, and has replaced a tangle of SQL scripts and spreadsheets. Say what you like about the rest of the Pentaho stack, but the ETL component is, in my opinion, excellent "bang for buck".
The whole business of ETL is based on the premise that the source of the data is incompatible with destination data source. And many times, the folks who dump the source data may not be thinking that this data needs to be collected and aggregated. This is why the whole business of ETL is in existent.
A commercial ETL tool will not magically read the source input and transform data according to the rules of the destination database. Rules have to be defined and fed into the ETL tool. Interestingly, many companies offer training!!! on how to use their proprietary scripting language. So it is not always that easy. But for non-programmers, maybe this is the preferred route.
Personally, I think that it is always easier to write a proprietary ETL tool in a language like Perl. Simply write a state-machine algorithm to rip through the source data and convert it to the desired format. I use Perl to FTP into machines, read in the files, transform the data, and then load it into the database. This is always a superior solution and much faster if one is proficient in Perl or similar, or can hire someone who knows Perl.
And one final point, start with the end in mind. Dump your source data in a structured format to help out the analysis group in your company who wants to aggregate and study the. This will make the ETL program easier and faster to develop.
I like Damir Sudarevic's answer and wanted to add that your choice of tool might also depend on how much work you have in front of you. If you have the occasional ETL task and are already familiar with a tool that will allow you to accomplish that task, use the tool you already know (this approach assigns a zero value to learning a new tool, which is perhaps undervaluing new knowledge). If you have a lot of ETL tasks, the up-front investment of learning a new tool might very well pay off. You can use pliers to drive a nail, and if you have only one nail you can use the pliers. If you have to drive 100 nails get yourself a hammer.
You can also do anything ETL tools can do with code. :-)
Both tool categories you mention can be used to solve this problem, but they are optimized for the class of problems they are trying to solve:
ETLs tend to come with a library of data manipulation tools (relational calculus, in-line computations, etc.), are optimized to handle large quantities of data, and have job management features (important if this isn't a single one-off data migration).
Build tools (for me, Ant comes to mind as a prototypical example) could do similar tasks, but are focused on compilation, file organization and manipulation, and packaging.

Resources