Are stored procedures required for large data sets? - sql-server

I've just started my first development job for a reasonably sized company that has to manage a lot of data. An average database is 6gb (from what I've seen so far). One of the jobs is reporting. How it's done currently is -
Data is replicated and transferred onto a data warehouse. From there, all the data required for a particular report is gathered (thousands of rows and lots of tables) and aggregated to a reports database in the warehouse. This is all done with stored procedures.
When a report is requested, a stored procedure is invoked which copies the data onto a reports database which PHP reads from to display the data.
I'm not a big fan of stored procs at all. But the people I've spoken to insist that stored procedures are the only option, as queries directly against the data via a programming language are incredibly slow (think 30 mins?). Security is also a concern.
So my question is - are stored procedures required when you have a very large data set? Do queries really take that long on such a large amount of data or is there a problem with either the DB servers or how the data is arranged (and indexed?). I've got a feeling that something is wrong.

The reasoning behind using a stored procedure is that the execution plan that is created in order to execute your procedure is cached by SQL Server in an area of memory known as the Plan Cache. When the procedure is then subsequently re-run at a later time, the execution plan has the possibility of being re-used.
A stored procedure will not run any faster than the same query, executed as a batch of T-SQL. It is the execution plans re-use that result in a performance improvement. The query cost will be the same for the actual T-SQL.
Offloading data to a reporting database is a typical pursuit however you may need to review your indexing strategy on the reporting database as it will likely need to be quite different from that of your OLTP platform for example.
You may also wish to consider using SQL Server Analysis Services in order to service your reporting requirements as it sounds like your reports contain lots of data aggregations. Storing and processing data for the purpose of fast counts and analytics is exactly what SSAS is all about. It sounds like it is time for your business to look as building a data warehouse.
I hope this helps but please feel free to request further details.
Cheers, John

In the context in which you are operating - large corporate database accessed in several places - it is virtually always best to place as much business logic inside the database as is possible.
In this case your immediate performance benefits are :
Firstly because if the the SP involves any processing beyond a simple select the processing of the data within the database can be orders of magnitude faster than sending rows across the network to your program for handling there.
You do acquire some benefits in that the SP is stored compiled. This is usually marginal compared to 1. if processing large volumes
However, and in my mind often more important than performance, is the fact that with corporate databases encapsulating the logic inside the database itself provides major management and maintenance benefits:-
Data structures can be abstracted away from program logic, allowing database structures to change without requiring changes to programs accessing the data. Anyone who has spent hours grep'ing a corporate codebase for SQL using [mytable] before making a simple database change will appreciate this.
SPs can provide a security layer, although this can be overused and overrelied on.
You say this is your first job for a company with a database of this type, so you can be forgiven for not appreciating how a database-centric approach to handling the data is really essential in such environments. You are not alone either - in a recent podcast Jeff Attwood said he wasn't a fan of putting code into databases. This is a fine and valid opinion where you are dealing with a database serving a single application, but is 100% wrong with a database used across a company by several applications, where the best policy is to screw down the data with a full complement of constraints and use SPs liberally for access and update.
The reason for this is if you don't such databases always lose data integrity and accumulate crud. Sometimes it's virtually impossible to imagine how they do, but in any large corporate database (tens of millions of records) without sufficient constraints there will be badly formed records - at best these force a periodic clean-up of data (a task I regularly used to get dumped with as a junior programmer), or worse will cause applications to crash due to invalid inputs, or even worse not cause them to crash but deliver incorrect business information to the end-users. And if your end user is your finance director then that's your job on the line :-)

It seems to me that there is an additional step in there that, based on your description, appears unneccessary. Here is what I am referring to -
When a report is requested, a stored
procedure is invoked which gathers the
data into a format required for a
report, and forwarded to another
stored procedure which transforms the
data into a view, and forwards THAT
off to a PHP framework for display.
A sproc transforms the data for a report, then another sproc transforms this data into another format for front-end presentation - is the data ever used in the format in which it is in after the first sproc? If not, that stage seems unneccessary to me.
I'm assuming that your reports database is a data warehouse and that data is ETL'ed and stored within in a format for the purposes of reporting. Where I currently work, this is common practice.
As for your question regarding stored procedures, they allow you to centralize logic within the database and "encapsulate" security, the first of which would appear to be of benefit within your organisation, given the other sprocs that you have for data transformation. Stored procedures also have a stored execution plan which, under some circumstances, can provide some improvement to performance.

I found that stored procedures help with large data sets because they eliminate a ton of network traffic, which can be a huge performance bottleneck depending on how large the data set actually is.

When processing large numbers of rows, where indexes are available and the SQL is relatively tuned, the database engine performing set-based operations directly on the data - through SQL, say - will almost always outperform row-by-row processing (even on the same server) in a client tool. The data is not crossing any physical or logical boudaries to leave the database server processes or to leave the database server and go out across the network. Even performing RBAR (row by agonizing row) on the server will be faster than performing it in a client tool, if only a limited amount of data really needs to ever leave the server, because...
When you start to pull more data across networks, then the process will slow down and limiting the number of rows at each stage becomes the next optimization.
All of this really has nothing to do with stored procedures. Stored procedures (in SQL Server) no longer provide much performance advantages over batch SQL. Stored procedures do provide a large number of other benefits like modularization, encapsulation, security management, design by contract, version management. Performance, however is no longer an advantage.

Generally speaking stored procedures have a number of advantages over direct queries. I can't comment on your complete end to end process, however, SPs will probably perform faster. For a start a direct query needs to be compiled and an execution plan worked out every time you do a direct query - SPs don't.
There are other reasons, why you would want to use stored procedure - centralisation of logic, security etc.

The end to end process does look a little complicated but there may be good reasons for it simply due to the data volume - it might well be that if you run the reports on the main database, the queries are slowing down the rest of the system so much that you'll cause problems for the rest of the users.
Regarding the stored procedures, their main advantage in a scenario like this is that they are pre-compiled and the database has already worked out what it considers to be the optimal query plan. Especially with the data volumes you are talking about, this might well result in a very noticeable performance improvement.
And yes, depending on the complexity of the report, a query like this can take half an hour or longer...

This reporting solution seems to have been designed by people that think the database is the centre of the world. This is a common and valid view – however I don’t always hold to it.
When moving data between tables/databases, it can be a lot quicker to use stored procs, as the data does not need to travel between the database and the application. However in most cases, I would rather not use stored proc as they make development more complex, I am in the ORM camp myself. You can sometimes get great speedups by loading lots into RAM and processing it there, however that is a totally different way of coding and will not allow the reuse of the logic that is already in the stored procs. Sorry I think you are stack with stored proc while in that job.
Giving the amount of data being moved about, if using SQL server I would look at using SSIS or DTS – oracle will have something along the same line. SSIS will do the data transformations on many threads while taking care of a lot of the details for you.
Remember the design of software has more to do with the history of the software and the people working it in, than it has to do with the “right way of doing it”. Come back in 100 years and we may know how to write software, at present it is mostly a case of the blind leading the blind. Just like when the first bridges were build and a lot of them fell down, no one could tell you in advance witch bridge would keep standing and why.

Unlike autogenerated code from an ORM product, stored procs can be performance tuned. This is critical in large production environment. There are many ways to tweak performance that are not available when using an ORM. Also there are many many tasks performed by a large database which have nothing to do with the user interface and thus should not be run from code produced from there.
Stored procs are also required if you want to control rights so that the users can only do the procedures specified in the proc and nothing else. Otherwise, users can much more easily make unauthorized changes to the databases and commit fraud. This is one reason why database people who work with large business critical systems, do not allow any access except through stored procs.
If you are moving large amounts of data to other servers though, I would consider using DTS (if using SQL Server 2000) or SSIS. This may speed up your processes still further, but it will depend greatly on what you are doing and how.
The fact that sps may be faster in this case doesn't preclude that indexing may be wrong or statistics out of date, but generally dbas who manage large sets of data tend to be pretty on top of this stuff.
It is true the process you describe seems a bit convoluted, but without seeing the structure of what is happening and understanding the database and environment, I can't say if maybe this is the best process.
I can tell you that new employees who come in and want to change working stuff to fit their own personal predjudices tend to be taken less than seriously and then you will have little credibility when you do need to suggest a valid change. This is particularly true when your past experience is not with databases of the same size or type of processing. If you were an expert in large systems, you might be taken more seriously from the start, but, face it, you are not and thus your opinion is not likely to sway anybody until you have been there awhile and they have a measure of your real capabilities. Plus if you learn the system as it is and work with it as it is, you will be in a better position in six months or so to suggest improvements rather than changes.

I could perhaps come up with more, but a few points.
Assuming a modern DB, stored procedures probably won't actually be noticeably faster than normal procedures due to caching and the like.
The security benefits of Stored procedures are somewhat overrated.
Change is evil. Consistency is king.
I'd say #3 trumps all other concerns unless stored procedures are causing a legitimate problem.

The faster way for reporting is to just read all data into memory (64 bit OS required) and just walk the objects. This is of course limited to ram size (affordable 32 GB) and reports where you hit a large part of the db. No need to make the effort for small reports.
In the old days I could run a report querying over 8 million objects in 1.5 seconds. That was in about a gigabyte of ram on a 3GHz pentium 4. 64 bit should be about twice as slow, but that is compensated by faster processors.

Related

Strategy for generating statistics with user data

I need a general piece of advice, but for the record i use jpa.
I need to generate usage data statistics, eg breakdown of user purchases per product, etc... I see three possible strategies, 1) generate on the fly stats each time the stats are being viewed, 2) create a specific table for stats that i would update each time there is a change 3) do offline processing at regular time intervals
All have issues and advanages, eg cost vs not up to date data, and i was wondering if anyone with experience in this field could provide some advice. I am aware the question s pretty broad, i can refine my use case if needed.
I've done a lot of reporting and the first question I always want to know is if the stakeholder needs the data in real time or not. This definitely shifts how you think and how you'll design a reporting system.
Based on the size of your data, I think it's possible to do real time reporting. If you had data in the millions, then maybe you'd need to do some pre-processing or data warehousing (your options 2/3).
Some general recommendations:
If you want to do real time reporting, think about making a copy of the database so you aren't running reports against production data. Some reports can use queries that are heavy, so it's worth looking into replicating production data to some other server where you can run reports.
Use intermediate structures a lot for reports. Write views, stored procedures, etc. so every report isn't just some huge complex query.
If the reports start to get too complex for doing at the database level, make sure you move the report logic into the application layer. I've been bitten by this many times. I start writing a report with queries purely from the database and eventually it gets too complex and I have to jump through hoops to make it work.
Shoot for real time and then go to stale data if necessary. Databases are capable of doing a lot more than you'd think. Quite often you can make changes to your database structures that will give you a big yield in performance.

Should Databases be used just for persistence

A lot of web applications having a 3 tier architecture are doing all the processing in the app server and use the database for persistence just to have database independence. After paying a huge amount for a database, doing all the processing including batch at the app server and not using the power of the database seems to be a waste. I have a difficulty in convincing people that we need to use best of both worlds.
What "power" of the database are you not using in a 3-tier archiecture? Presumably we exploit SQL to the full, and all the data management, paging, caching, indexing, query optimisation and locking capabilities.
I'd guess that the argument is where what we might call "business logic" should be implemented. In the app server or in database stored procedure.
I see two reasons for putting it in the app server:
1). Scalability. It's comparatively hard to add more datbase engines if the DB gets too busy. Partitioning data across multiple databases is really tricky. So instead pull the business logic out to the app server tier. Now we can have many app server instances all doing business logic.
2). Maintainability. In principle, Stored Procedure code can be well-written, modularised and resuable. In practice it seems much easier to write maintainable code in an OO language such as C# or Java. For some reason re-use in Stored Procedures seems to happen by cut and paste, and so over time the business logic becomes hard to maintain. I would concede that with discipline this need not happen, but discipline seems to be in short supply right now.
We do need to be careful to truly exploit the database query capabilities to the full, for example avoiding pulling large amounts of data across to the app server tier.
It depends on your application. You should set things up so your database does things databases are good for. An eight-table join across tens of millions of records is not something you're going to want to handle in your application tier. Nor is performing aggregate operations on millions of rows to emit little pieces of summary information.
On the other hand, if you're just doing a lot of CRUD, you're not losing much by treating that large expensive database as a dumb repository. But simple data models that lend themselves to application-focused "processing" sometimes end up leading you down the road to creeping unforeseen inefficiencies. Design knots. You find yourself processing recordsets in the application tier. Looking things up in ways that begin to approximate SQL joins. Eventually you painfully refactor these things back to the database tier where they run orders of magnitude more efficiently...
So, it depends.
No. They should be used for business rules enforcement as well.
Alas the DBMS big dogs are either not competent enough or not willing to support this, making this ideal impossible, and keeping their customers hostage to their major cash cows.
I've seen one application designed (by a pretty smart guy) with tables of the form:
id | one or two other indexed columns | big_chunk_of_serialised_data
Access to that in the application is easy: there are methods that will load one (or a set) of objects, deserialising it as necessary. And there are methods that will serialise an object into the database.
But as expected (but only in hindsight, sadly), there are so many cases where we want to query the DB in some way outside that application! This is worked around is various ways: an ad-hoc query interface in the app (which adds several layers of indirection to getting the data); reuse of some parts of the app code; hand-written deserialisation code (sometimes in other languages); and simply having to do without any fields that are in the deserialised chunk.
I can readily imagine the same thing occurring for almost any app: it's just handy to be able to access your data. Consequently I think I'd be pretty averse to storing serialised data in a real DB -- with possible exceptions where the saving outweighs the increase in complexity (an example being storing an array of 32-bit ints).

What's the benefit of using a lot of complex stored procedures

For typical 3-tiered application, I have seen that in many cases they use a lot of complex stored procedures in the database. I cannot quite get the benefit of this approach. In my personal understanding, there are following disadvantages on this approach:
Transactions become coarse.
Business logic goes into database.
Lots of computation is done in the database server, rather than in the application server. Meanwhile, the database still needs to do its original work: maintain data. The database server may become a bottleneck.
I can guess there may be 2 benefits of it:
Change the business logic without compile. But the SPs are much more harder to maintain and test than Java/C# code.
Reduce the number of DB connection. However, in the common case, the bottleneck of database is hard disk io rather than network io.
Could anyone please tell me the benefits of using a lot of stored procedures rather than letting the work be done in business logic layer?
Basically, the benefit is #2 of your problem list - if you do a lot of processing in your database backend, then it's handled there and doesn't depend on the application accessing the database.
Sure - if your application does all the right things in its business logic layer, things will be fine. But as soon as a second and a third application need to connect to your database, suddenly they too have to make sure to respect all the business rules etc. - or they might not.
Putting your business rules and business logic in the database ensures that no matter how an app, a script, a manager with Excel accesses your database, your business rules will be enforced and your data integrity will be protected.
That's the main reason to have stored procs instead of code-based BLL.
Also, using Views for read and Stored Procs for update/insert, the DBA can remove any direct permissions on the underlying tables. Your users do no longer need to have all the rights on the tables, and thus, your data in your tables is better protected from unadvertent or malicious changes.
Using a stored proc approach also gives you the ability to monitor and audit database access through the stored procs - no one will be able to claim they didn't alter that data - you can easily prove it.
So all in all: the more business critical your data, the more protection layer you want to build around it. That's what using stored procs is for - and they don't need to be complex, either - and most of those stored procs can be generated based on table structure using code generation, so it's not a big typing effort, either.
Don't fear the DB.
Let's also not confuse business logic with data logic which has its rightful place at the DB.
Good systems designers will encompass flexible business logic through data logic, i.e. abstract business rule definitions which can be driven by the (non)existence or in attributes of data rows.
Just FYI, the most successful and scalable "enterprise/commercial" software implementations with which I have worked put all projection queries into views and all data management either into DB procedures or triggers on staged tables.
Network between appServer and sqlServer is the bottle neck very often.
Stored procedures are needed when you need to do complex query.
For example you want collect some data about employee by his surname. Especially imagine, that data in DB looks like some kind of tree - you have 3 records about this employee in table A. You have 10 records in table B for each record in table A. You have 100 records in table C for each record in table B. And you want to get only special 5 records from table C about that employee. Without stored procedures you will get a lot of queries traffic between appServer and sqlServer, and a lot of code in appServer. With stored procedure which accepts employee surname, fetches those 5 records and returns them to appServer you 1) decrease traffic by hundreds times, 2) greatly simplify appServer code.
The life time of our data exceeds that of our applications. Also data gets shared between applications. So many applications will insert data into the database, many applications will retrieve data from it. The database is responsible for the completeness, integrity and correctness of the data. Therefore it needs to have the authority to enforce the business rules relating to the data.
Taking you specific points:
Transactions are Units Of Work. I
fail to see why implementing
transactions in stored procedures
should change their granularity.
Business logic which applies to the
data belongs with the data: that
maximises cohesion.
It is hard to write good SQL and to
learn to think in sets. Therefore
it may appear that the database is
the bottleneck. In fact, if we are
undertaking lots of work which
relates to the data the database is
probably the most efficient place to
do.
As for maintenance: if we are familiar with PL/SQL, T-SQL, etc maintenance is easier than it might appear from the outside. But I concede that tool support for things like refactoring lags behind that of other languages.
You listed one of the main ones putting business logic in the Db often gives the impression of making it easier to maintain.
Generally complex SP logic in the db allows for cheaper implementation of the actual implementation code, which may be beneficial if its a transitional application (say being ported from legacy code), its code which needs to be implemented in several languages (for instance to market on different platforms or devices) or because the problem is simpler to solve in the db.
One other reason for this is often there is a general "best practice" to encapsulate all access to the db in sps for security or performance reasons. Depending on your platform and what you are doing with it this may or may not be marginally true.
I don't think there are any. You are very correct that moving the BL to the database is bad, but not for everything. Try taking a look at Domain Driven Design. This is the antidote to massive numbers of SPROCs. I think you should be using your database as somewhere to store you business objects, nothing more.
However, SPROCs can be much more efficient on certain, simple functions. For instance, you might want to increase the salary to every employee in your database by a fixed percentage. This is quicker to do via a SPROC than getting all the employees from the db, updating them and then saving them back.
I worked in a project where every thing is literally done in database level. We wrote lot of stored procedures and did lot of business validation / logic in the database. Most of the times it became a big overhead for us to debug.
The advantages I felt may be
Take advantage of full DB features.
Database intense activities like lot of insertion/updation can be better done in DB level. Call an SP and let it do all the work instead of hitting DB several times.
New DB servers can accommodate complex operations so they no longer see this as a bottleneck. Oh yeah, we used Oracle.
Looking at it now, I think few things could have been better done at application level and lesser at DB level.
It depends almost entirely on the context.
Doing work on the server rather than on the clients is generally a bad idea as it makes your server less scalable. However, you have to balance this against the expected workload (if you know you will only have 100 users in a closed enironment, you may not need a scalable server) and network traffic costs (if you have to read a lot of data to apply calculations/processes to, then it can be cheaper/faster overall to run those calculations on the server and only send the results over the net).
Also, if you have custom client applications (as opposed to web browsers etc) it makes it very easy to push updates out to your clients, because you don't need to recompile and deploy the client code, you simply upgrade the database stored procedures.
Of course, using stored procedures rather than executing dynamically compiled SQL statements can be more efficient (it's precompiled, and the code doesn't need to be uploaded to the server) and aids encapsulation to give the database better integrity/security. But by the sound of it, you're talking about masses of busines logic, not simple efficiency and security measures.
As with most things, a sensible compromise/balance is needed. Stored Procedures should be used enough to enhance efficiency and security, but you don't want your server to become unscalable.
"there are following disadvantages on this approach:
...
Business logic goes into database."
Insofar as by "busines logic" you mean "enforcement of business rules", the DBMS is EXACTLY where "business logic" belongs.

Testing custom ORM solution performance overhead - how to?

I have created a prototype of a custom ORM tool using aspect oriented programming (PostSHarp) and achieving persistence ignorance (before compile-time). Now I tried to find out how much overhead does it introduce compared to using pure DataReader and ADO.NET. I made a test case - insert, read, delete data (about 1000 records) in MS SQL Server 2008 and MySQL Community Edition. I run this test multiple times using pure ADO.NET and my custom tool.
I expected that results will depend on many factors - memory, swapping, CPU, other processes so I ran tests for many times (20-40). But the results were really unexpected. They just differed too much between those cases. If there were just some extreme values, I could ignore them (maybe swapping ocurred or smth. like that) but they were so different that I am sure I cannot trust this kind of testing. Almost half of times my ORM showed 10% better performance than pure ADO.NET, other times it was -10%.
Is there any way I can make those tests reliable? I do not have a powerful computer with lots of memory, but maybe I somehow can make MS SQL and MySQL or ADO.NET to be as consistent as possible during those tests? And how about count of records - which is more reliable, using small amount of records and running more times or other way?
Have you seen ORMBattle.NET? See FAQ there, there are some ideas related to measuring performance overhead introduced by a particular ORM tool. Test suite is open source.
Concerning your results:
Some ORM tools automatically batch statement sequences (i.e. send several SQL statements together). If this feature is implemented well in ORM, it's easy to beat plain ADO.NET by 2-4 times on CRUD operations, if ADO.NET test does not involve batching. Tests on ORMBattle.NET test both cases.
A lot depends on how you establish transaction boundaries there. Please refer to ORMBattle.NET FAQ for details.
CRUD tests aren't best performance indicator at all. In general, it's pretty easy to get
peak possible performance here, since in general, RDBMS must do much more than ORM in this case.
P.S. I'm one of ORMBattle.NET authors, so if you're interested in details / possible contributions, you can contact me directly (or join ORMBattle.NET Google Groups).
I would run the test for a longer duration and with many more iterations as small differences would average out over time and you should get a clearer picture. Also, make sure you eliminate any external things that may be affecting your test, such as other processes running, non enough free memory, cold start vs warm start, network usage, etc.
Also, make sure that your database file and log file have enough free space allocated so you aren't waiting for the DB to grow the file during certain tests.
First of all you need to find out where does the variance come from. The ORM layer itself or the database?
Many times the source of such variance is the database itself. Databases are very complex systems, with many active processes inside that can interact with the result of performance measurements. To achieve some reproductible results you'll have to place your database under 'laboratory' conditions and make sure nothing unexpected happens. what that means depends from vendor to vendor and you need know some pretty advanced topics in order to tacle something like this. For instance, on a SQL Server database the typical sources of variation are:
cold cache vs. warm cache (both data and procedures)
log and database growth events
maintenance jobs
ghost cleanup
lazy writer
checkpoints
external memory pressure

Performance Testing a Greenfield Database

Assuming that best practices have been followed when designing a new database, how does one go about testing the database in a way that can improve confidence in the database's ability to meet adequate performance standards, and that will suggest performance-enhancing tweaks to the database structure if they are needed?
Do I need test data? What does that look like if no usage patterns have been established for the database yet?
NOTE: Resources such as blog posts and book titles are welcome.
I would do a few things:
1) simulate user/application connection to the db and test load (load testing).
I would suggest connecting with many more users than are expected to actually use the system. You can have all your users log in or pick up third party software that will log in many many users and perform defined functions that you feel is an adequate test of your system.
2) insert many (possibly millions) of test records and load test again.(scalability testing). As tables grow you may find you need indexes where you didn't have them before. Or there could be problems with VIEWS or joins used through out the system.
3) Analyze the database. I am referring to the method of analyzing tables. Here is a boring page describing what it is. Also here is a link to a great article on Oracle datbase tuning. Some of which might relate to what you are doing.
4) Run queries generated by applications/users and run explain plans for them. This will, for example, tell you when you have full table scans. It will help you fix a lot of your issues.
5) Also backup and reload from these backups to show confidence in this as well.
You could use a tool such as RedGate's Data Generator to get a good load of test data in it to see how the schema performs under load. You're right that without knowing the usage patterns it's difficult to put together a perfect test plan but I presume you must have a rough idea as to the kind of queries that will be run against it.
Adequate performance standards are really defined by the specific client applications that will consume your database. Get a sql profiler trace going whilst the applications hit your db and you should be able to quickly spot any problem areas which may need more optimising (or even de-normalising in some cases).
+1 birdlips, agree with the suggestions. However, database load testing can be very tricky precisely because the first and the crucial step is about predicting as best as possible the data patterns that will be encountered in the real world. This task is best done in conjunction with at least one domain expert, as it's very much to do with functional, not technical aspects of the system.
Modeling data patterns is ever so critical as most SQL execution plans are based on table "statistics", i.e. counts and ratios, which are used by modern RDBMS to calculate the optimal query execution plan. Some people have written books on the so called "query optimizers", e.g. Cost Based Oracle Fundamentals and it's quite often a challenge troubleshooting some of these issues due to a lack of documentation of how the internals work (often intentional as RDBMS vendors don't want to reveal too much about the details).
Back to your question, I suggest the following steps:
Give yourself a couple of days/weeks/months (depending on the size and complexity of the project) to try to define the state of a 'mature' (e.g. 2-3 year old) database, as well as some performance test cases that you would need to execute on this large dataset.
Build all the scripts to pump in the baseline data. You can use 3rd party tools, but I often found them lacking in functionality to do some more advanced data distributions and also often its much faster to write SQLs than to learn new tools.
Build/implement the performance test scenario client! This now heavily depends on what kind of an application the DB needs to support. If you have a browser-based UI there are many tools such as LoadRunner, JMeter to do end-to-end testing. For web services there's SoapSonar, SoapUI... Maybe you'll have to write a custom JDBC/ODBC/.Net client with multi-threading capabilities...
Test -> tune -> test -> tune...
When you place the system in production get ready for surprises as your prediction of data patterns will never be very accurate. This means that you (or whoever is the production DBA) may need to think on his/her feet and create some indexes on the fly or apply other tricks of the trade.
Good luck
I'm in the same situation now, here's my approach (using SQL Server 2008):
Create a separate "Numbers" table with millions of rows of sample data. The table may have random strings, GUIDs, numerical values, etc.
Write a procedure to insert the sample data into your schema. Use modulus (%) of a number column to simulate different UserIDs, etc.
Create another "NewData" table similar to the first table. This can be used to simulate new records being added.
Create a "TestLog" table where you can record rowcount, start time and end time for your test queries.
Write a stored procedure to simulate the workflow you expect your application to perform, using new or existing records as appropriate.
If performance seems fast, consider the probability of a cache miss! For example, if your production server has 32GB RAM, and your table is expected to be 128GB, a random row lookup is >75% likely to not be found in the buffer cache.
To simulate this, you can clear the cache before running your query:
DBCC DROPCLEANBUFFERS;
(If Oracle: ALTER SYSTEM FLUSH SHARED POOL)
You may notice a 100x slowdown in performance as indexes and data pages must now be loaded from disk.
Run SET STATISTICS IO ON; to gather query statistics. Look for cases where the number of logical reads is very high (> 1000) for a query. This is usually a sign of a full table scan.
Use the standard techniques to understand your query access patterns (scans vs. seek) and tune performance.
Include Actual Execution plan, SQL Server Profiler

Resources