Related
I have a table in IBM DB2 which contains more than 100 million records . Database was made 13 years ago and is not partitioned . Searching data and creating joins with this table takes huge amount of time .What should be proper approach to optimize searching and joins .
1. Using Non Clustered Index and searching via indexes .
2. Partitioning Table
3. or any other efficient approach.
I would like thanks in advance for your valuable time and efforts.
A "proper" approach is, of course, subjective. It's usually a trade-off, and the things most people trade off are the cost of implementing the change, the cost of maintaining the change, and the performance of the solution.
In all cases, I recommend gathering metrics, and agreeing your target - otherwise, you risk continuously optimizing beyond the point the business really needs. Typically, this means creating a representative test environment, with representative data. You then run the queries as they are today, and measure their performance. Finally, you agree (with whoever is paying the bills) what the minimum and optimum targets are. Once you reach that target - stop!
By far the cheapest solution is to optimize your queries, which often means creating indices. Depending on your queries, this can sometimes take just a few hours, and doesn't require any ongoing maintenance.
The next thing to do is to look at server configuration - tuning the memory allocation and disk strategy can do wonders, and making sure the database statistics are up to date. These tasks usually require 2 or 3 people to work together, and you may need to set up regular maintenance tasks.
If that doesn't do the job, consider improving the hardware. If your database server is as old as the database (13 years), it's quite possible that your mobile phone has better performance characteristics than your server. It's much cheaper to improve the hardware than it is to go to the next steps.
If hardware doesn't solve the problem, consider de-normalizing your data. For instance, if you are running lots of queries joining your large table to other large tables, consider creating a de-normalized table with all the data you need to fulfill that query. This is expensive, both from a development point of view (you have to work out how to maintain the denormalized data, how to make sure all the queries still work), and from a maintenance point of view - the additional complexity will make all enhancements and bug fixes harder.
If denormalizing doesn't work, partitioning is the next most expensive solution. This is a fairly drastic solution, because as far as I know, there's no "out of the box" solution to glue your front-end applications into the partitioning logic. So, pretty much every piece of code that needs to interact with the database needs to understand the partitioning logic, and a bug in any one place will break every other component that interacts with that data.
Azure's DocumentDB has a write optimized JSON datastore with automatic indexing of records. Are there good resources to read about how this is achieved? Is this well documented in the academic database literature?
DocumentDB describes the indexing policy as:
Automatic indexing of documents is enabled by write optimized, lock free, and log structured index maintenance techniques. DocumentDB supports a sustained volume of fast writes while still serving consistent queries.
http://azure.microsoft.com/en-us/documentation/articles/documentdb-indexing-policies/
It is also claimed that this index typically requires 2-20% of the size of the main table:
Based production usage in consumer scale first party applications using DocumentDB, the typical index overhead is between 2-20%. The indexing technology used by DocumentDB ensures that regardless of the values of the properties, the index overhead does not exceed more than 80% of the size of the documents with default settings.
http://azure.microsoft.com/en-us/documentation/articles/documentdb-manage/#IndexOverhead
Are there any papers that describe how to implement this sort of indexing scheme?
There is no paper, yet.
A paper describing the internals of our indexing has been drafted and is undergoing final reviews.
We expect to publish this as soon as it is final.
What are the top issues and in which order of importance to look into while optimizing (performance tuning, troubleshooting) an existing (but unknown to you) database?
Which actions/measures in your previous optimizations gave the most effect (with possibly the minimum of work) ?
I'd like to partition this question into following categories (in order of interest to me):
one needs to show the performance boost (improvements) in the shortest time. i.e. most cost-effective methods/actions;
non-intrusive or least-troublesome most effective methods (without changing existing schemas, etc.)
intrusive methods
Update:
Suppose I have a copy of a database on dev machine without access to production environment to observe stats, most used queries, performance counters, etc. in real use.
This is development-related but not DBA-related question.
Update2:
Suppose the database was developed by others and was given to me for optimization (review) before it was delivered to production.
It is quite usual to have outsourced development detached from end-users.
Besides, there is a database design paradigm that a database, in contrast to application data storage, should be a value in itself independently on specific applications that use it or on context of its use.
Update3: Thanks to all answerers! You all pushed me to open subquestion
How do you stress load dev database (server) locally?
Create a performance Baseline (non-intrusive, use performance counters)
Identify the most expensive queries (non-intrusive, use SQL Profiler)
Identify the most frequently run queries (non-intrusive, use SQL Profiler)
Identify any overly complex queries, or those using slowly performing constructs or patterns. (non-intrusive to identify, use SQL Profiler and/or code inspections; possibly intrusive if changed, may require substantial re-testing)
Assess your hardware
Identify Indexes that would benefit the measured workload (non-intrusive, use SQL Profiler)
Measure and compare to your baseline.
If you have very large databases, or extreme operating conditions (such as 24/7 or ultra high query loads), look at the high end features offered by your RDBMS, such as table/index partitioning.
This may be of interest: How Can I Log and Find the Most Expensive Queries?
If the database is unknown to you, and you're under pressure, then you may not have time for Mitch's checklist which is good best practice to monitor server health.
You also need access to production to gather real info from assorted queries you can run. Without this, you're doomed. The server load pattern is important: you can't reproduce many issue yourself on a development server because you won't use the system like an end user.
Also, focus on "biggest bang for the buck". An expensive query running once daily at 3am can be ignored. A not-so-expensive one running every second is well worth optimising. However, you may not know this without knowing server load pattern.
So, basic steps..
Assuming you're firefighting:
server logs
SQL Server logs
sys.sysprocesses eg ASYNC_NETWORK_IO waits
Slow response:
profiler, with a duration filter. What runs often and is lengthy
most expensive query, weighted for how often used
open transaction with plan
weighted missing index
Things you should have:
Backups
Tested restore of aforementioned backups
Regular index and statistic maintenance
Regular DBCC and integrity checks
Edit: After your update
Static analysis is best practices only: you can't optimise for usage. This is all you can do. This is marc_s' answer.
You can guess what the most common query may be, but you can't guess how much data will be written or how badly a query scales with more data
In many shops developers provide some support, either directly or as *3rd line"
If you've been given a DB for review by another team that you hand over to another team to deploy: that's odd.
If you're not interested in the runtime behavior of the database, e.g. what are the most frequently executed queries and those that consume the most time, you can only do a "static" analysis of the database structure itself. That has a lot less value, really, since you can only check for a number of key indicators of bad design - but you cannot really tell much about the "dynamics" of the system being used.
Things I would check for in a database that I get as a .bak file - without the ability to collect live and actual runtime performance statistics - would be:
normalization - is the table structure normalized to third normal form? (at least most of the time - there might be some exceptions)
do all tables have a primary key? ("if it doesn't have a primary key, it's not a table", after all)
For SQL Server: do all the tables have a good clustering index? A unique, narrow, static, and preferably ever-increasing clustered key - ideally an INT IDENTITY, and most definitely not a large compound index of many fields, no GUID's and no large VARCHAR fields (see Kimberly Tripp's excellent blog posts on the topics for details)
are there any check and default constraints on the database tables?
are all the foreign key fields backed up by a non-clustered index to speed up JOIN queries?
are there any other, obvious "deadly sins" in the database, e.g. overly complicated views, or really badly designed tables etc.
But again: without actual runtime statistics, you're quite limited in what you can do from a "static analysis" point of view. The real optimization can only really happen when you have a workload from a regular day of operation, to see what queries are used frequently and put the most stress on your database --> use Mitch's checklist to check those points.
The most important thing to do is collect up-to-date statistics. Performance of a database depends on:
the schema;
the data in the database; and
the queries being executed.
Looking at any of those in isolation is far less useful than the whole.
Once you have collected the statistics, then you start identifying operations that are sub-par.
For what it's worth, the vast majority of performance problems we've fixed have been by either adding indexes, adding extra columns and triggers to move the cost of calculations away from the select to the insert/update, and tactfully informing the users that their queries are, shall we say, less than optimal :-)
They're usually pleased that we can just give them an equivalent query that runs much faster.
I am currently into a performance tuning exercise. The application is DB intensive with very little processing logic. The performance tuning is around the way DB calls are made and the DB itself.
We did the query tuning, We put the missing indexes, We reduced or eliminated DB calls where possible. The application is performing very well and all is fine.
With smaller data volume (say upto 100,000 records), the performance is fantastic. My Question is, what needs to be done to ensure such good performance at higher data volumes ?
The data volumes are expected to reach 10 million records.
I can think of table and index partitioning, suggesting filesystems optimized for DB storage and periodic archiving to keep the number of rows in check. I would like to know what else could be done. Any tips/strategies/patterns would be very helpful.
Monitoring. Use some tools to monitor performance, and saturation of CPU, memory, and I/O. Make trend lines so you know where your next bottleneck will be before you get there.
Testing. Create mock data so you have 10 million rows on a testing server today. Benchmark the queries you have in your application and see how well they perform as the volume of data increases. You might be surprised at what breaks down first, or it may go exactly as predicted. The point is that you can find out.
Maintenance. Make sure your application and infrastructure support some downtime, because that's always necessary. You might have to defrag and rebuild your indexes. You might have to refactor some of the table structure. You might have to upgrade the server software or apply patches. To do this without interrupting continuous operation, you'll need some redundancy built in to the design.
Research. Find the best journals and blogs for the database brand you're using, and read them (e.g. http://www.mysqlperformanceblog.com if you use MySQL). You can ask good questions like the one you ask here, but also read what other people are asking, and what they're being advised to do about it. You can learn solutions to problems that you don't even have yet, so that when you have them, you'll have some strategies to employ.
Different databases need to be tuned in different ways. What RDBMS are you using?
Also, how do you know whether or not what you have done so far will result in poor performance with larger data sets? Have you tested your current optimisations with a large amount of test data?
When you did this, how did the performance change? If you are able to tune the database so that it performs with the data it has now, there's no reason to think that your methods won't work with a larger data set.
Depending on the RDBMS, the next type of solution is simple: get bigger, beefier hardware. More RAM, more disks, more CPUs.
You are on the right way:
1) Proper indexes
2) DBMS options tuning (memory caches, buffers, internal threads control and so on)
3) Queries tuning (especially log slow queries and then tune/rewrite them)
4) To tune your queries and indexes you may need to research your queries execution plans
5) Poweful dedicated server
6) Think about queries which your client applications send to the database. Are they always necessary? Do you need all the data you ask for? Is it possible to cache some data?
10 million records is probably too small to bother with partitioning. Typically partitioning will only be interesting if your data volumes are an order or magnitude or so bigger than that.
Index tuning for a database with 100,000 rows will probably get you 99% of what you need with 10 million rows. Keep an eye out for table scans or index range scans on the large tables in the system. On smaller tables they are fine and in some cases even optimal.
Archiving old data may help but this is probably overkill for 10 million rows.
One possible optimisation is to move reporting off onto a separate server. This will reduce the burden on the server - reports are often quite anti-social when run on operational systems as the schema tends not to be well optimised for it.
You can use database replication to do this or make a data mart for reporting. Replication is easier to implement but the reports will be less efficient, no more efficient than they were on the production system. Building a star schema data mart will be more efficient for reporting but incur additional development work.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
The community reviewed whether to reopen this question 1 year ago and left it closed:
Original close reason(s) were not resolved
Improve this question
Ok, dumb question I know but I see the nebulous comment 'a large database' as well as small and medium and I wonder just what that means. Can someone define what a small, medium and large database is for us SQL neophytes?
There isn't a threshold where a small database becomes medium or a medium database becomes large. Generally, when I hear these terms, I think of particular orders of magnitude in terms of total records being stored.
Small: Fits in a spreadsheet.
Medium: Fits in memory on a commodity server.
Large: Fits in a commodity cloud offering.
Very large: Fits in a specialized environment; unusual storage, latency, or throughput characteristics.
As poster dkretz suggested, you could also think about it in terms of the properties each kind of database has. Categorizing it this way, I'd say:
Small: Performance is not a concern. Your queries run fine without making any special optimizations. You see only a marginal performance difference when using front-line enhancements like indexes.
Medium: Your database probably has one or more staff that are assigned part-time to its maintenance and care. These people pay attention to the database's health; their primary administrative responsibility is to prevent unacceptable performance problems and minimize downtime.
Large: Probably has dedicated staff member(s) whose job is to work on the database and improve performance, as well as make sure that application changes don't cause schema breakage over the lifetime of the database. Metrics about the health and status of the database are monitored closely. Significant expertise is required to understand and perform optimizations.
Very large: The database stores vast amounts of information that must be readily accessible. Performance optimizations are absolutely required to wring every last ounce of speed out of each queries, and without it, the database would be much less usable or even impossible to use. The database may be using sophisticated or innovative replication or clustering techniques, pushing the boundaries of current technology.
Note that these are entirely subjective, and that someone may very well have a perfectly legitimate alternate definition of "large".
One way to figure it is by observing your test queries.
A small database is one where indexes don't matter.
A medium database is one where queries take longer than one second if you don't have an appropriate index in place.
A big database is one where queries often take hours to optimize, using a combination of query design, index modification, and many test cycles.
Large database are ones that force you have to stop using relational databases.
In other words, a normalized, relational database where all the indexes in the world can't help you meet your response time requirements because of the massive JOINs.
If you've ever had to abandon relational databases for something else, you're either a poor database developer, have no expert DBA, or have a very large database.
“Large Database” is indeed a nebulous concept. There are already very different answers and opinions posted in the answers to this question. Some approaches to define “small”, “medium” and “large” Databases may make more sense than others BUT THEN, at some point, I consider each definition is right, true and valid.
Some definitions make more sense than others because they focus on different aspects of importance for the design, programming, use, maintenance and administration of a Database and these different aspects are what really matter for a usable Database. It just happens that all these aspects are impacted by the nebulous concept of “Database size”.
So, Does this mean that it does not matter if you are able to define if a particular Database is big or not?
Certainly not. What it mean is you will apply the concept differently while evaluating different design/operational/administrative aspects of your Database. It also means that every time this concept will be nebulous.
As an example: Database Index strategy (an aspect of Database design) is impacted by record count for each table (a measure of “size”), by record size times record count (another measure of “size”), and by Query Vs. Creation/Update/Delete operations ratio (an aspect of Database usage).
Query response times are better if indexes are used for tables with large amount of records. Depending on the nature of your WHERE, ORDER BY and record-aggregation clauses you may need several indexes for certain tables.
Creation, Update and Delete operations are impacted negatively with the increase of number of indexes on the affected table(s). More indexes for an affected table means more changes that the RDBMS must perform, spending more time and more resources to apply those changes.
Also, if your RDBMS spends more time to apply those changes, then the locks are maintained for longer times also, impacting the response times other queries being sent to the system at the same time.
So, How do you balance the quantity and design of your indexes? How do you know if you need an additional index and if by adding that index you will not be introducing a big negative impact on query response times? Answer: You test and profile your database against a target load as per your load/performance requirements and analyze the profiling data in order to discover if further optimizations/redesigns/indexes are needed.
Different Index strategies are required for different Query Vs. Creation/Update/Delete operations ratios. If your Database is under a heavy load of queries but is rarely updated, the performance for the overall application will be better if you add every index that improves query response times. On the other hand, if your Database is constantly being updated but there are not large query operations, then the performance will be better if you use less indexes.
There are other aspects of course: Database Schema design, Storage Strategy, Network design, Backup strategy, Stored Procedures/Triggers/Etc. programming, Application Programming (against the Database), Etc. All these aspects are impacted differently by distinct concepts of “size” (record size, record count, index size, index count, schema design, storage size, etc.).
I'd like to have more time as this topic is fascinating. I hope this small contribution serves as an starting point for you in this fascinating world of SQL.
You have to account for hardware advancement for this definition:
Small database: working set fits into the physical RAM of a single commodity server (about 16GB now)
Medium database: fits into a single or several (through RAID) commodity hard drives on a single machine (up to several TBs now)
Large database: Data needs to distributed across multiple commodity servers in order to fit (up to several PBs now.)
According to wikipedia article on Very Large Database
A very large database, or VLDB, is a database that contains an extremely high number of tuples (database rows), or occupies an extremely large physical filesystem storage space. The most common definition of VLDB is a database that occupies more than 1 terabyte or contains several billion rows, although naturally this definition changes over time.
If you have a database that is large enough that you can't just "back it up" to put on a development or test box, you likely have a "large database".
I think something like wikipedia, or the US census data is a 'big' database. My personal address lists or todos is a small database. A middle sized database is something in between.
You could try and define the sizes by how many servers you needed. A small database is a component of an application you run on your desktop, a mid-sized database would be a single mysql (whatever) server somewhere, and a large database is going to require multiple servers with some kind of replication/failover support.
Alternatively, consider the "size" of the database as the amount of time it takes to change the schema used to represent a domain of information. (In actual implementations, databases may contain multiple schemas and disparate domains at once.)
Days = "Small database."
Weeks = "Medium database."
Months = "Large database."
Years = "HUGE database."
With this heuristic, "size" ultimately an aspect of the information stored and the rate at which the information can be fully transformed. Such an approach based on time also maintains some semblance of how-does-this-affect-design-decisions as the sheer amount of data / number of rows increases and the performance of technology & implementations increases.
A variation of the above is to consider the “size” based on the amount of time required for management and routine maintenance. At the amount of data increases, so do the time for tasks such as backups, rebuilds, and upgrades. Without significant investment this may outpace the time available for such tasks.
Regardless, the key factor of “size” is time.