Oracle recommendations for high volume writes and low volume read - database

Is there some general guidelines online on how to tweak oracle for doing a high number of inserts and low number of reads?
All the answers below are pretty good recommendations. I have to clarify the following things. I am using 10g and this is an absolute requirement that we use Oracle. I am also more interested in oracle instance parameters for tuning (perhaps some different locking policies).

Let me assume you want to do an excessive high number of inserts, so that you simply want to just ignore all other kinds of operations just to get those inserts to complete, without problems.
First, have you completely ruled out other types of databases? There are systems like industry databases that cope very well with massive amounts of inserts, typically used to receive and store data from equipment that is measuring something in a factory environment. Oracle is a relational database, it might not be the right type of software for your needs.
Having said that, let's assume you can, or will, or should, use Oracle. The very first thing you need to do is to consider all the various types of data you need to make this assumption about. If they're all about the same kind of data, you need 1 table, and it need to be lean and mean regarding inserts.
The optimal way do that is to do the following:
do not add any indexes on this table at all, if you need a primary key, that's the only index you want
if you need to do reads against this table, consider having a shadow table with indexes that you do reads, lookups, and aggregates against. If this doesn't have to be up-to-the-millisecond updated, consider a periodic batch job to update it with data from the master table. This will disturb the master table with read-locks as little as possible
Make sure your server has fast disks. Transactional write operations will typically involve the disk at some point, so make sure that's a small bottleneck as you can get.
If your application is gathering data from many incoming sources, consider adding a layer in front of the database that will keep the number of concurrent connections and thus transactions to that table to a minimum. If you get a high number of write-locks on the same page for an oracle database, ultimately your performance will suffer.
If you can split up the data, consider splitting it in such a way that it is stored on different physical disks. That way, disk I/O problems won't be cross-data-type, and only affect one type of data.
In the other end of the spectrum you have a denormalized table with lots of indexes optimized for a balance between lookups and updates, and you need to find some middle-way that will get you the performance you want.

In terms of database design put as few constraints, indexes and triggers on the table(s) you're inserting into as possible as these will all slow down the insert.
The lack of indexes will obviously hurt your SELECT performance, but it doesn't sound like this is your primary concern.

What sort of application are we talking about? What version of Oracle?
If you are designing a data warehouse load process, for example, you would generally want to do direct-path inserts into staging table(s), then build any necessary indexes, then do a partition exhange to load the data into the partitioned destination table. This doesn't work as well, of course, if you are doing single-row inserts.
Depending on the Oracle version and the type of application, you may also want to enable compression on the table. Inserts are generally cheap from a CPU standpoint, so there is probably plenty of CPU available to do the compression which can substantially decrease the amount of I/O required, which is generally going to be your bottleneck.

I'm going to suggest that you take your question to Tom Kyte's site, http://asktom.oracle.com. You can generally find an answer there. Otherwise, try Oracle's forums.
Also try looking up any of Tom Kyte's books. Suggest checking the library or your local bookstore to find the right one, to ensure that the book contains the right topics for you. Also, his blog has links to his books and some articles/discussions on each book.
I did a quick google, site:oracle.com tuning write, and found this
OracleAS TopLink Writing Optimization Features. I realize that you might not be using TopLink but it may have some good tips. Keywords you'll want to try using: tuning, performance, insert(s), improve. Also through in the technology you are using like java/c++/etc.
Other tips you can try:
using stored procedures or using them in more efficient ways.
tweaking your server's hardware. Faster hard drives or a specific RAID array, possibly more cpu's.
Ask Tom thread - some nice comments here, also links to Fowler's site
You will probably have to start running some performance analytics on your queries/implementations to find the sweet spot for each one. I wish I had an easy answer for you. Good Luck!

A couple of suggestions for you to look into further:-
direct path load
block compression

Related

non clustered index for 100 millions records in a table or pratitions

I have a table in IBM DB2 which contains more than 100 million records . Database was made 13 years ago and is not partitioned . Searching data and creating joins with this table takes huge amount of time .What should be proper approach to optimize searching and joins .
1. Using Non Clustered Index and searching via indexes .
2. Partitioning Table
3. or any other efficient approach.
I would like thanks in advance for your valuable time and efforts.
A "proper" approach is, of course, subjective. It's usually a trade-off, and the things most people trade off are the cost of implementing the change, the cost of maintaining the change, and the performance of the solution.
In all cases, I recommend gathering metrics, and agreeing your target - otherwise, you risk continuously optimizing beyond the point the business really needs. Typically, this means creating a representative test environment, with representative data. You then run the queries as they are today, and measure their performance. Finally, you agree (with whoever is paying the bills) what the minimum and optimum targets are. Once you reach that target - stop!
By far the cheapest solution is to optimize your queries, which often means creating indices. Depending on your queries, this can sometimes take just a few hours, and doesn't require any ongoing maintenance.
The next thing to do is to look at server configuration - tuning the memory allocation and disk strategy can do wonders, and making sure the database statistics are up to date. These tasks usually require 2 or 3 people to work together, and you may need to set up regular maintenance tasks.
If that doesn't do the job, consider improving the hardware. If your database server is as old as the database (13 years), it's quite possible that your mobile phone has better performance characteristics than your server. It's much cheaper to improve the hardware than it is to go to the next steps.
If hardware doesn't solve the problem, consider de-normalizing your data. For instance, if you are running lots of queries joining your large table to other large tables, consider creating a de-normalized table with all the data you need to fulfill that query. This is expensive, both from a development point of view (you have to work out how to maintain the denormalized data, how to make sure all the queries still work), and from a maintenance point of view - the additional complexity will make all enhancements and bug fixes harder.
If denormalizing doesn't work, partitioning is the next most expensive solution. This is a fairly drastic solution, because as far as I know, there's no "out of the box" solution to glue your front-end applications into the partitioning logic. So, pretty much every piece of code that needs to interact with the database needs to understand the partitioning logic, and a bug in any one place will break every other component that interacts with that data.

Which DBMS is suitable for my needs?

I'm working on a project aimed to analyze biometric data collected from various terminals. The process is not very performance critical. Rather it's I/O bounded. Amount of data is very huge. (hundreds of millions records per table). Unfortunately database is relational. And there are 20 foreign keys. Changing values of referenced keys is very common during completion of job. So there will be lots of UPDATE and SET NULL s during collecting data.
Currently, semantics of database is designed. All programs are almost completed, and also a MySQL prototype for database is created. It works fine with sample (small-scale) data.
I do a search to find a suitable DBMS for the project. Googling around "DBMS comparisons" ,... didn't help. People say antithesis things. Some say MySQL will perform faster inserts and updates, some say Oracle9 is better...
I can't find any reliable, benchmark-based comparison between DBMS. I use MySQL in everyday projects, but this one looks more critical.
What we need:
License and cost of DBMS is not important, but of course an open source (GPL or LGPL) is preferred (since whole project is will be published under LGPL).
Very fast inserts, very fast updates, a lot of foreign keys is needed.
DBMS should response to 0 - 100 connections at a time.
Terminals are connected to server by a local network (LAN).
What I'm actually looking for, is a benchmark of various DBMS's. It may contain charts, separated comparisons of different operations (insert, update, delete) in various situations (on a relation with referenced fields, or normal table)...
For this sort of answer, I would recommend PostgreSQL, Informix, or Oracle. PostgreSQL is open source (BSDL, GPL compatible, as everyone agrees). The reasons have to do with some aspects of data modelling that may be extremely helpful in your case. In general you have two important questions:
1) How far can I tune my db for what I am doing? How far can I scale it?
and
2) How can I model my data?
On the first, Oracle and PostgreSQL are more complex but more flexible. That flexibility may come in handy. On the second, the flexibility may save you a lot of effort later. Moreover it opens up new doors regarding optimization which are not possible in a straight relational model. First I would recommend looking at this: http://db.cs.berkeley.edu/papers/Informix/www.informix.com/informix/corpinfo/zines/whitpprs/illuswp/wave.htm as it will give you some background as to what I am thinking. Additionally, if you look at what Stonebraker is talking about you will see that straight benchmarks are really an apples to oranges comparison here.
The idea of going with an ORDBMS means a few important things:
You can model data functionally dependent on your data. For example you can have a function in Java or Python which manipulates your data and returns a result. You can index the output of those functions, trading insert for select performance if you need to, or not, trading between insert and select performance.
Less data being stored means faster inserts.
An ability to extend your data with custom types and functions, providing higher performance access to your data.
PostgreSQL 9.2 will support up to approx 14000 writes per sec on sufficient hardware, which is nothing to sneeze at. Of course this depends on the width of the write, hardware performance on the server, etc. PostgreSQL is used by Affilias to manage the .org and .info top-level domains (web-scale!) and also by Skype's infrastructure (still, even after Microsoft bought them).
Finally as a part of your information pipeline, if you are processing huge amounts of data and need to do some preprocessing before sending to PostgreSQL, you might look at array-native db's (for a NoSQL approach common in scientific work) or VoltDB (for an in-memory store for high-throughput processing). Despite the fact that they are extremely different systems, VoltDB and Postgres were actually started by the same individual.
Finally regarding benchmark charts, the major db vendors more or less ban publication of such in their license agreements so you won't find them.

Why don't databases intelligently create the indexes they need?

I just heard that you should create an index on any column you're joining or querying on. If the criterion is this simple, why can't databases automatically create the indexes they need?
Well, they do; to some extent at least...
See SQL Server Database Engine Tuning Advisor, for instance.
However, creating optimal indexes is not as simple as you mentioned. An even simpler rule could be to create indexes on every column (which is far from optimal)!
Indexes are not free. You create indexes at the cost of storage and update performance among other things. They should be carefully thought about to be optimal.
Every index you add may increase the speed of your queries. It will decrease the speed of your updates, inserts and deletes and it will increase disk space usage.
I, for one, would rather keep the control to myself, using tools such as DB Visualizer and explain statements to provide the information I need to evaluate what should be done. I do not want a DBMS unilaterally deciding what's best.
It's far better, in my opinion, that a truly intelligent entity be making decisions re database tuning. The DBMS can suggest all it wants but the final decision should be left up to the DBAs.
What happens when the database usage patterns change for one week? Do you really want the DBMS creating indexes and destroying them a week later? That sounds like a management nightmare scenario right up alongside Skynet :-)
This is a good question. Databases could create the indexes they need based on data usage patterns, but this means that the database would be slow the first time certain queries were executed and then get faster as time goes on. For example if there is a table like this:
ID USERNAME
-- --------
: then the username would be used to look up the users very often. After some time the database could see that say 50% of queries did this, in which case it could add an index on the username.
However the reason that this hasn't been implemented in great detail is simply because it is not a killer feature. Adding indexes is performed relatively few times by the DBA, and by automating this (which is a very big task) is probably just not worth it for the database vendors. Remember that every query will have to be analyzed to enable auto indexes, and also the query response time, and result set size as well, so it is non-trivial to implement.
Because databases simply store and retrieve data - the database engine has no clue how you intend to retrieve that data until you actually do it, in which case it is too late to create an index. And the column you are joining on may not be suitable for an efficient index.
It's a non-trivial problem to solve, and in many cases a sub-optimal automatic solution might actually make things worse. Imagine a database whose read operations were sped up by automatic index creation but whose inserts and updates got hosed as a result of the overhead of managing the index? Whether that's good or bad depends on the nature of your database and the application it's serving.
If there were a one-size-fits-all solution, databases would certainly do this already (and there are tools to suggest exactly this sort of optimization). But tuning database performance is largely an app-specific function and is best accomplished manually, at least for now.
An RDBMS could easily self-tune and create indices as it saw fit but this would only work for simple cases with queries that do not have demanding execution plans. Most indices are created to optimize for specific purposes and these kinds of optimizations are better handled manually.

Performance Testing a Greenfield Database

Assuming that best practices have been followed when designing a new database, how does one go about testing the database in a way that can improve confidence in the database's ability to meet adequate performance standards, and that will suggest performance-enhancing tweaks to the database structure if they are needed?
Do I need test data? What does that look like if no usage patterns have been established for the database yet?
NOTE: Resources such as blog posts and book titles are welcome.
I would do a few things:
1) simulate user/application connection to the db and test load (load testing).
I would suggest connecting with many more users than are expected to actually use the system. You can have all your users log in or pick up third party software that will log in many many users and perform defined functions that you feel is an adequate test of your system.
2) insert many (possibly millions) of test records and load test again.(scalability testing). As tables grow you may find you need indexes where you didn't have them before. Or there could be problems with VIEWS or joins used through out the system.
3) Analyze the database. I am referring to the method of analyzing tables. Here is a boring page describing what it is. Also here is a link to a great article on Oracle datbase tuning. Some of which might relate to what you are doing.
4) Run queries generated by applications/users and run explain plans for them. This will, for example, tell you when you have full table scans. It will help you fix a lot of your issues.
5) Also backup and reload from these backups to show confidence in this as well.
You could use a tool such as RedGate's Data Generator to get a good load of test data in it to see how the schema performs under load. You're right that without knowing the usage patterns it's difficult to put together a perfect test plan but I presume you must have a rough idea as to the kind of queries that will be run against it.
Adequate performance standards are really defined by the specific client applications that will consume your database. Get a sql profiler trace going whilst the applications hit your db and you should be able to quickly spot any problem areas which may need more optimising (or even de-normalising in some cases).
+1 birdlips, agree with the suggestions. However, database load testing can be very tricky precisely because the first and the crucial step is about predicting as best as possible the data patterns that will be encountered in the real world. This task is best done in conjunction with at least one domain expert, as it's very much to do with functional, not technical aspects of the system.
Modeling data patterns is ever so critical as most SQL execution plans are based on table "statistics", i.e. counts and ratios, which are used by modern RDBMS to calculate the optimal query execution plan. Some people have written books on the so called "query optimizers", e.g. Cost Based Oracle Fundamentals and it's quite often a challenge troubleshooting some of these issues due to a lack of documentation of how the internals work (often intentional as RDBMS vendors don't want to reveal too much about the details).
Back to your question, I suggest the following steps:
Give yourself a couple of days/weeks/months (depending on the size and complexity of the project) to try to define the state of a 'mature' (e.g. 2-3 year old) database, as well as some performance test cases that you would need to execute on this large dataset.
Build all the scripts to pump in the baseline data. You can use 3rd party tools, but I often found them lacking in functionality to do some more advanced data distributions and also often its much faster to write SQLs than to learn new tools.
Build/implement the performance test scenario client! This now heavily depends on what kind of an application the DB needs to support. If you have a browser-based UI there are many tools such as LoadRunner, JMeter to do end-to-end testing. For web services there's SoapSonar, SoapUI... Maybe you'll have to write a custom JDBC/ODBC/.Net client with multi-threading capabilities...
Test -> tune -> test -> tune...
When you place the system in production get ready for surprises as your prediction of data patterns will never be very accurate. This means that you (or whoever is the production DBA) may need to think on his/her feet and create some indexes on the fly or apply other tricks of the trade.
Good luck
I'm in the same situation now, here's my approach (using SQL Server 2008):
Create a separate "Numbers" table with millions of rows of sample data. The table may have random strings, GUIDs, numerical values, etc.
Write a procedure to insert the sample data into your schema. Use modulus (%) of a number column to simulate different UserIDs, etc.
Create another "NewData" table similar to the first table. This can be used to simulate new records being added.
Create a "TestLog" table where you can record rowcount, start time and end time for your test queries.
Write a stored procedure to simulate the workflow you expect your application to perform, using new or existing records as appropriate.
If performance seems fast, consider the probability of a cache miss! For example, if your production server has 32GB RAM, and your table is expected to be 128GB, a random row lookup is >75% likely to not be found in the buffer cache.
To simulate this, you can clear the cache before running your query:
DBCC DROPCLEANBUFFERS;
(If Oracle: ALTER SYSTEM FLUSH SHARED POOL)
You may notice a 100x slowdown in performance as indexes and data pages must now be loaded from disk.
Run SET STATISTICS IO ON; to gather query statistics. Look for cases where the number of logical reads is very high (> 1000) for a query. This is usually a sign of a full table scan.
Use the standard techniques to understand your query access patterns (scans vs. seek) and tune performance.
Include Actual Execution plan, SQL Server Profiler

ensuring database performance as data volume increases

I am currently into a performance tuning exercise. The application is DB intensive with very little processing logic. The performance tuning is around the way DB calls are made and the DB itself.
We did the query tuning, We put the missing indexes, We reduced or eliminated DB calls where possible. The application is performing very well and all is fine.
With smaller data volume (say upto 100,000 records), the performance is fantastic. My Question is, what needs to be done to ensure such good performance at higher data volumes ?
The data volumes are expected to reach 10 million records.
I can think of table and index partitioning, suggesting filesystems optimized for DB storage and periodic archiving to keep the number of rows in check. I would like to know what else could be done. Any tips/strategies/patterns would be very helpful.
Monitoring. Use some tools to monitor performance, and saturation of CPU, memory, and I/O. Make trend lines so you know where your next bottleneck will be before you get there.
Testing. Create mock data so you have 10 million rows on a testing server today. Benchmark the queries you have in your application and see how well they perform as the volume of data increases. You might be surprised at what breaks down first, or it may go exactly as predicted. The point is that you can find out.
Maintenance. Make sure your application and infrastructure support some downtime, because that's always necessary. You might have to defrag and rebuild your indexes. You might have to refactor some of the table structure. You might have to upgrade the server software or apply patches. To do this without interrupting continuous operation, you'll need some redundancy built in to the design.
Research. Find the best journals and blogs for the database brand you're using, and read them (e.g. http://www.mysqlperformanceblog.com if you use MySQL). You can ask good questions like the one you ask here, but also read what other people are asking, and what they're being advised to do about it. You can learn solutions to problems that you don't even have yet, so that when you have them, you'll have some strategies to employ.
Different databases need to be tuned in different ways. What RDBMS are you using?
Also, how do you know whether or not what you have done so far will result in poor performance with larger data sets? Have you tested your current optimisations with a large amount of test data?
When you did this, how did the performance change? If you are able to tune the database so that it performs with the data it has now, there's no reason to think that your methods won't work with a larger data set.
Depending on the RDBMS, the next type of solution is simple: get bigger, beefier hardware. More RAM, more disks, more CPUs.
You are on the right way:
1) Proper indexes
2) DBMS options tuning (memory caches, buffers, internal threads control and so on)
3) Queries tuning (especially log slow queries and then tune/rewrite them)
4) To tune your queries and indexes you may need to research your queries execution plans
5) Poweful dedicated server
6) Think about queries which your client applications send to the database. Are they always necessary? Do you need all the data you ask for? Is it possible to cache some data?
10 million records is probably too small to bother with partitioning. Typically partitioning will only be interesting if your data volumes are an order or magnitude or so bigger than that.
Index tuning for a database with 100,000 rows will probably get you 99% of what you need with 10 million rows. Keep an eye out for table scans or index range scans on the large tables in the system. On smaller tables they are fine and in some cases even optimal.
Archiving old data may help but this is probably overkill for 10 million rows.
One possible optimisation is to move reporting off onto a separate server. This will reduce the burden on the server - reports are often quite anti-social when run on operational systems as the schema tends not to be well optimised for it.
You can use database replication to do this or make a data mart for reporting. Replication is easier to implement but the reports will be less efficient, no more efficient than they were on the production system. Building a star schema data mart will be more efficient for reporting but incur additional development work.

Resources