Practical limits of SQL-Server database - sql-server

I am setting up a database that I anticipate will be quite large, used for calculations and data storage. It will be one table with maybe 10 fields, containing one primary key and two foreign keys to itself. I anticipate there will be about a billion records added daily.
Each record should be quite small, and I will primarily be doing inserts. With each insert I will need to do a simple update on one or two fields of a connected record. All queries should be relatively simple.
At what size will I start running into performance problems with sql-server? I've seen mention of vldb systems, but also heard they may be a real pain. Is there a threshold where I should start looking at that? Is there a better db than sql-server that is designed for this sort of thing?

When talking about transaction rates of over 10k/sec you shouldn't be asking advice on forums... This is close to TPC-C benchmark performance on 32 and 64 ways, which cost millions to tune up.
At what size will you be running into problems?
With a good data model and schema design, a properly tunned and with correct capacity planned server will not run into problems for 1 bil. records per day. The latest published SQL Server benchmarks are at about 1.2 mil tran/min. That is roughtly 16k transactions per second, at system priced at USD ~6 mil in 2005 (64 way Superdome). To achieve 10k tran/sec for your planned load you're not going to need a Superdome, but you are going to need a quite beefy system (at least 16 way probably) and specially a very very good I/O subsytem. When doing back of the envelope capacity planning one usualy considers about 1K tran/sec per HBA and 4 CPU cores to feed the HBA. And you're going to need quite a few database clients (application mid-tiers) just to feed 1 bil. records per day into the database. I'm not claiming that I did your capacity planning here, but I just wanted to give you a ballpark of what are we talking about. This is a multi-million dollars project, and something like this is not designed by asking advice on forums.

Unless you're talking large as in Google's index type of large, the Enterprise databases like SQL Server or Oracle will do just fine.
James Devlin over at Coding the Wheel summed it up nicely (though this is more of a comparison between free DB's like MySQL with Oracle/SQL Server
Nowadays I like to think of SQL Server and Oracle as the Death Stars of the relational database universe. Extremely powerful. Monolithic. Brilliant. Complex almost beyond the ability of a single human mind to understand. And a monumental waste of money except in those rare situations when you actually need to destroy a planet.
As far as performance goes, it all really depends on your indexing strategy. Inserts are really the bottleneck here, as the records need to be indexed as they come in, the more indexing you have, the longer inserts will take.
In the case of something like Google's index, read up on "Big Table", it's quit interesting how Google set it up to use clusters of servers to handle searches across enormous amounts of data in mere milliseconds.

It can be done, but given your hardware costs and plans get MS in to spec things out for you. It will be fraction of your HW costs.
Saying that, Paul Nielson blogged about 35k TPS (3 billion rows per day) 2 years ago. Comments worth reading too and reflect some of what Remus said

The size of the database itself does not create performance problem. Practical problems in database size come from operational/maintenance issues.
For example:
De-fragmenting and re-building indexes take too long.
Backups take too long or take up too much space.
Database restores cannot be performed quick enough in case of an outage.
Future changes to the database tables take too long to apply.
I would recommend designing/building in some sort of partitioning from the start. It can be SQL Server partitioning, application partitioning (e.g. one table per month), archiving (e.g. to a different database).
I believe that these problems occur in any database product.
In addition, be sure to make allowances for transaction log file sizes.

Related

non clustered index for 100 millions records in a table or pratitions

I have a table in IBM DB2 which contains more than 100 million records . Database was made 13 years ago and is not partitioned . Searching data and creating joins with this table takes huge amount of time .What should be proper approach to optimize searching and joins .
1. Using Non Clustered Index and searching via indexes .
2. Partitioning Table
3. or any other efficient approach.
I would like thanks in advance for your valuable time and efforts.
A "proper" approach is, of course, subjective. It's usually a trade-off, and the things most people trade off are the cost of implementing the change, the cost of maintaining the change, and the performance of the solution.
In all cases, I recommend gathering metrics, and agreeing your target - otherwise, you risk continuously optimizing beyond the point the business really needs. Typically, this means creating a representative test environment, with representative data. You then run the queries as they are today, and measure their performance. Finally, you agree (with whoever is paying the bills) what the minimum and optimum targets are. Once you reach that target - stop!
By far the cheapest solution is to optimize your queries, which often means creating indices. Depending on your queries, this can sometimes take just a few hours, and doesn't require any ongoing maintenance.
The next thing to do is to look at server configuration - tuning the memory allocation and disk strategy can do wonders, and making sure the database statistics are up to date. These tasks usually require 2 or 3 people to work together, and you may need to set up regular maintenance tasks.
If that doesn't do the job, consider improving the hardware. If your database server is as old as the database (13 years), it's quite possible that your mobile phone has better performance characteristics than your server. It's much cheaper to improve the hardware than it is to go to the next steps.
If hardware doesn't solve the problem, consider de-normalizing your data. For instance, if you are running lots of queries joining your large table to other large tables, consider creating a de-normalized table with all the data you need to fulfill that query. This is expensive, both from a development point of view (you have to work out how to maintain the denormalized data, how to make sure all the queries still work), and from a maintenance point of view - the additional complexity will make all enhancements and bug fixes harder.
If denormalizing doesn't work, partitioning is the next most expensive solution. This is a fairly drastic solution, because as far as I know, there's no "out of the box" solution to glue your front-end applications into the partitioning logic. So, pretty much every piece of code that needs to interact with the database needs to understand the partitioning logic, and a bug in any one place will break every other component that interacts with that data.

NoSQL or RDBMS for audit data

I know that similar questions were asked in the subject, but I still haven't seen anyone that completely contained all my requests.
I would start by saying that I only have experience in RDBMS's so I'm sorry if I get anything regarding NoSQL wrong.
I'm creating a database that would hold a large amount of audit logs (about 1TB).
I'm using it for:
Fast data writing (a massive amount of audit logs is written all the time)
Search - search over the audit data (search actions performed by a certain user, at a certain time or a certain action... the database should support searching any of the 'columns' very quickly)
Analytics & Reporting - Generate daily, weekly, monthly reports of the data (They are predefined at the moment.. if they are more dynamic, does it affect the solution I should choose?)
Reliability (support for fail-over or any similar feature), Scalability (If I grow above 1TB to 2TB, 10TB or 100TB - does any of the solutions can't support this amount of data?) and of course Performance (in the use cases I specified) are very important to me.
I know RDBMS and that would be my easy way of starting, but I'm really concerned that after a while, the DB would simply not keep up with the pace.
My question is should I pick an RDBMS or NoSQL solution and why? If a NoSQL solution, since they are so different, which of them do you think fits my needs?
Generally there isn't a right or wrong answer here.
Fast data writing, either solution will be ok, although you didn't say what volume per second you are storing. Both solutions have things to watch out for.
Search (very quick) over all columns. For smaller volumes, say few hundred Gb, then either solution will be Ok (assuming skilled people put it together). You didn't actually say how fast/often you search, so if it is many times per minute this consideration becomes more important. Fast search can often slow down ability to write high volumes quickly as indexes required for search need to be updated.
Audit records typically have a time component, so searching that is time constrained, eg within last 7 days, will significantly speed up search times compared to search all records.
Reporting. When you get up to 100Tb, you are going to need some real tricks, or a big budget, to get fast reporting. For static reporting, you will probably end up creating one program that generates multiple reports at once to save I/O. Dynamic reports will be the tricky one.
My opinion? Since you know RDBMS, I would start with that as a method and ship the solution. This buys you time to learn the real problems you will encounter (the no premature optimization that many on SO are keen on). During this initial timeframe you can start to select nosql solutions and learn them. I am assuming here that you want to run your own hardware/database, if you want to use cloud type solutions, then go to them straight away.

Best data store for billions of rows

I need to be able to store small bits of data (approximately 50-75 bytes) for billions of records (~3 billion/month for a year).
The only requirement is fast inserts and fast lookups for all records with the same GUID and the ability to access the data store from .net.
I'm a SQL server guy and I think SQL Server can do this, but with all the talk about BigTable, CouchDB, and other nosql solutions, it's sounding more and more like an alternative to a traditional RDBS may be best due to optimizations for distributed queries and scaling. I tried cassandra and the .net libraries don't currently compile or are all subject to change (along with cassandra itself).
I've looked into many nosql data stores available, but can't find one that meets my needs as a robust production-ready platform.
If you had to store 36 billion small, flat records so that they're accessible from .net, what would choose and why?
Storing ~3.5TB of data and inserting about 1K/sec 24x7, and also querying at a rate not specified, it is possible with SQL Server, but there are more questions:
what availability requirement you have for this? 99.999% uptime, or is 95% enough?
what reliability requirement you have? Does missing an insert cost you $1M?
what recoverability requirement you have? If you loose one day of data, does it matter?
what consistency requirement you have? Does a write need to be guaranteed to be visible on the next read?
If you need all these requirements I highlighted, the load you propose is going to cost millions in hardware and licensing on an relational system, any system, no matter what gimmicks you try (sharding, partitioning etc). A nosql system would, by their very definition, not meet all these requirements.
So obviously you have already relaxed some of these requirements. There is a nice visual guide comparing the nosql offerings based on the 'pick 2 out of 3' paradigm at Visual Guide to NoSQL Systems:
After OP comment update
With SQL Server this would e straight forward implementation:
one single table clustered (GUID, time) key. Yes, is going to get fragmented, but is fragmentation affect read-aheads and read-aheads are needed only for significant range scans. Since you only query for specific GUID and date range, fragmentation won't matter much. Yes, is a wide key, so non-leaf pages will have poor key density. Yes, it will lead to poor fill factor. And yes, page splits may occur. Despite these problems, given the requirements, is still the best clustered key choice.
partition the table by time so you can implement efficient deletion of the expired records, via an automatic sliding window. Augment this with an online index partition rebuild of the last month to eliminate the poor fill factor and fragmentation introduced by the GUID clustering.
enable page compression. Since the clustered key groups by GUID first, all records of a GUID will be next to each other, giving page compression a good chance to deploy dictionary compression.
you'll need a fast IO path for log file. You're interested in high throughput, not on low latency for a log to keep up with 1K inserts/sec, so stripping is a must.
Partitioning and page compression each require an Enterprise Edition SQL Server, they will not work on Standard Edition and both are quite important to meet the requirements.
As a side note, if the records come from a front-end Web servers farm, I would put Express on each web server and instead of INSERT on the back end, I would SEND the info to the back end, using a local connection/transaction on the Express co-located with the web server. This gives a much much better availability story to the solution.
So this is how I would do it in SQL Server. The good news is that the problems you'll face are well understood and solutions are known. that doesn't necessarily mean this is a better than what you could achieve with Cassandra, BigTable or Dynamo. I'll let someone more knowleageable in things no-sql-ish to argument their case.
Note that I never mentioned the programming model, .Net support and such. I honestly think they're irrelevant in large deployments. They make huge difference in the development process, but once deployed it doesn't matter how fast the development was, if the ORM overhead kills performance :)
Contrary to popular belief, NoSQL is not about performance, or even scalability. It's mainly about minimizing the so-called Object-Relational impedance mismatch, but is also about horizontal scalability vs. the more typical vertical scalability of an RDBMS.
For the simple requirement of fasts inserts and fast lookups, almost any database product will do. If you want to add relational data, or joins, or have any complex transactional logic or constraints you need to enforce, then you want a relational database. No NoSQL product can compare.
If you need schemaless data, you'd want to go with a document-oriented database such as MongoDB or CouchDB. The loose schema is the main draw of these; I personally like MongoDB and use it in a few custom reporting systems. I find it very useful when the data requirements are constantly changing.
The other main NoSQL option is distributed Key-Value Stores such as BigTable or Cassandra. These are especially useful if you want to scale your database across many machines running commodity hardware. They work fine on servers too, obviously, but don't take advantage of high-end hardware as well as SQL Server or Oracle or other database designed for vertical scaling, and obviously, they aren't relational and are no good for enforcing normalization or constraints. Also, as you've noticed, .NET support tends to be spotty at best.
All relational database products support partitioning of a limited sort. They are not as flexible as BigTable or other DKVS systems, they don't partition easily across hundreds of servers, but it really doesn't sound like that's what you're looking for. They are quite good at handling record counts in the billions, as long as you index and normalize the data properly, run the database on powerful hardware (especially SSDs if you can afford them), and partition across 2 or 3 or 5 physical disks if necessary.
If you meet the above criteria, if you're working in a corporate environment and have money to spend on decent hardware and database optimization, I'd stick with SQL Server for now. If you're pinching pennies and need to run this on low-end Amazon EC2 cloud computing hardware, you'd probably want to opt for Cassandra or Voldemort instead (assuming you can get either to work with .NET).
Very few people work at the multi-billion row set size, and most times that I see a request like this on stack overflow, the data is no where near the size it is being reported as.
36 billion, 3 billion per month, thats roughly 100 million per day, 4.16 million an hour, ~70k rows per minute, 1.1k rows a second coming into the system, in a sustained manner for 12 months, assuming no down time.
Those figures are not impossible by a long margin, i've done larger systems, but you want to double check that is really the quantities you mean - very few apps really have this quantity.
In terms of storing / retrieving and quite a critical aspect you have not mentioned is aging the older data - deletion is not free.
The normal technology is look at is partitioning, however, the lookup / retrieval being GUID based would result in a poor performance, assuming you have to get every matching value across the whole 12 month period. You could place a clustered indexes on the GUID column will get your associated data clusterd for read / write, but at those quantities and insertion speed, the fragmentation will be far too high to support, and it will fall on the floor.
I would also suggest that you are going to need a very decent hardware budget if this is a serious application with OLTP type response speeds, that is by some approximate guesses, assuming very few overheads indexing wise, about 2.7TB of data.
In the SQL Server camp, the only thing that you might want to look at is the new parrallel data warehouse edition (madison) which is designed more for sharding out data and running parallel queries against it to provide high speed against large datamarts.
"I need to be able to store small bits of data (approximately 50-75 bytes) for billions of records (~3 billion/month for a year).
The only requirement is fast inserts and fast lookups for all records with the same GUID and the ability to access the data store from .net."
I can tell you from experience that this is possible in SQL Server, because I have done it in early 2009 ... and it's still operation to this day and quite fast.
The table was partitioned in 256 partitions, keep in mind this was 2005 SQL version ... and we did exactly what you're saying, and that is to store bits of info by GUID and retrieve by GUID quickly.
When i left we had around 2-3 billion records, and data retrieval was still quite good (1-2 seconds if get through UI, or less if on RDBMS) even though the data retention policy was just about to be instantiated.
So, long story short, I took the 8th char (i.e. somewhere in the middle-ish) from the GUID string and SHA1 hashed it and cast as tiny int (0-255) and stored in appropriate partition and used same function call when getting the data back.
ping me if you need more info...
The following article discusses the import and use of a 16 billion row table in Microsoft SQL.
https://www.itprotoday.com/big-data/adventures-big-data-how-import-16-billion-rows-single-table.
From the article:
Here are some distilled tips from my experience:
The more data you have in a table with a defined clustered index, the slower it becomes to import unsorted records into it. At some
point, it becomes too slow to be practical.
If you want to export your table to the smallest possible file, make it native format. This works best with tables containing
mostly numeric columns because they’re more compactly represented
in binary fields than character data. If all your data is
alphanumeric, you won’t gain much by exporting it in native format.
Not allowing nulls in the numeric fields can further compact the
data. If you allow a field to be nullable, the field’s binary
representation will contain a 1-byte prefix indicating how many
bytes of data will follow.
You can’t use BCP for more than 2,147,483,647 records because the BCP counter variable is a 4-byte integer. I wasn’t able to find any
reference to this on MSDN or the Internet. If your table consists of
more than 2,147,483,647 records, you’ll have to export it in chunks
or write your own export routine.
Defining a clustered index on a prepopulated table takes a lot of disk space. In my test, my log exploded to 10 times the original
table size before completion.
When importing a large number of records using the BULK INSERT statement, include the BATCHSIZE parameter and specify how many
records to commit at a time. If you don’t include this parameter,
your entire file is imported as a single transaction, which
requires a lot of log space.
The fastest way of getting data into a table with a clustered index is to presort the data first. You can then import it using the BULK
INSERT statement with the ORDER parameter.
There is an unusual fact that seems to overlooked.
"Basically after inserting 30Mil rows in a day, I need to fetch all the rows with the same GUID (maybe 20 rows) and be reasonably sure I'd get them all back"
Needing only 20 columns, a non-clustered index on the GUID will work just fine. You could cluster on another column for data dispersion across partitions.
I have a question regarding the data insertion: How is it being inserted?
Is this a bulk insert on a certain schedule (per min, per hour, etc)?
What source is this data being pulled from (flat files, OLTP, etc)?
I think these need to be answered to help understand one side of the equation.
Amazon Redshift is a great service. It was not available when the question was originally posted in 2010, but it is now a major player in 2017. It is a column based database, forked from Postgres, so standard SQL and Postgres connector libraries will work with it.
It is best used for reporting purposes, especially aggregation. The data from a single table is stored on different servers in Amazon's cloud, distributed by on the defined table distkeys, so you rely on distributed CPU power.
So SELECTs and especially aggregated SELECTs are lightning fast. Loading large data should be preferably done with the COPY command from Amazon S3 csv files. The drawbacks are that DELETEs and UPDATEs are slower than usual, but that is why Redshift in not primarily a transnational database, but more of a data warehouse platform.
You can try using Cassandra or HBase, though you would need to read up on how to design the column families as per your use case.
Cassandra provides its own query language but you need to use Java APIs of HBase to access the data directly.
If you need to use Hbase then I recommend querying the data with Apache Drill from Map-R which is an Open Source project. Drill's query language is SQL-Compliant(keywords in drill have the same meaning they would have in SQL).
With that many records per year you're eventually going to run out of space.
Why not filesystem storage like xfs which supports 2^64 files and using smaller boxes.
Regardless of how fancy people want to get or the amount of money one would end up spend getting a system with whatever database SQL NoSQL ..whichever these many records are usually made by electric companies and weather stations/providers like ministry of environment who control smaller stations throughout the country.
If you're doing something like storing pressure.. temperature..wind speed.. humidity etc...and guid is the location..you can still divide the data by year/month/day/hour.
Assuming you store 4 years of data per hard-drive.
You can then have it run on a smaller Nas with mirror where it would
also provide better read speeds and have multiple mount points..based on the year when it was created.
You can simply make a web-interface for searches
So dumping location1/2001/06/01//temperature and location1/2002/06/01//temperature would only dump the contents of hourly temperature for the 1st day of summer in those 2 years (24h*2) 48 small files vs searching a database with billions of records and possibly millions spent.
Simple way of looking at things.. 1.5 billion websites in the world with God knows how many pages each
If a company like Google had to spend millions per 3 billion searches to pay for super-computers for this they'd be broke.
Instead they have the power-bill...couple million crap computers.
And caffeine indexing...future-proof..keep adding more.
And yeah where indexing running off SQL makes sense then great
Building super-computers for crappy tasks with fixed things like weather...statistics and so on so techs can brag their systems crunches xtb in x seconds...waste of money that can be spent somewhere else..maybe that power-bill that won't run into the millions anytime soon by running something like 10 Nas servers.
Store records in plain binary files, one file per GUID, wouldn't get any faster than that.
You can use MongoDB and use the guid as the sharding key, this means that you can distribute your data over multiple machines but the data you want to select is only on one machine because you select by the sharding key.
Sharding in MongoDb is not yet production ready.

30 million records a day, SQL Server can't keep up, other kind of database system needed?

Some time ago I thought an new statistics system over, for our multi-million user website, to log and report user-actions for our customers.
The database-design is quite simple, containing one table, with a foreignId (200,000 different id's), a datetime field, an actionId (30 different id's), and two more fields containing some meta-information (just smallints). There are no constraints to other tables. Furthermore we have two indexes each containing 4 fields, which cannot be dropped, as users are getting timeouts when we are having smaller indexes. The foreignId is the most important field, as each and every query contains this field.
We chose to use SQL server, but after implementation doesn't a relational database seem like a perfect fit, as we cannot insert 30 million records a day (it's insert only, we don't do any updates) when also doing alot of random reads on the database; because the indexes cannot be updated fast enough. Ergo: we have a massive problem :-) We have temporarily solved the problem, yet
a relational database doesn't seem to be suited for this problem!
Would a database like BigTable be a better choice, and why? Or are there other, better choices when dealing with this kind of problems?
NB. At this point we use a single 8-core Xeon system with 4 GB memory and Win 2003 32-bit. RAID10 SCSI as far as I know. The index size is about 1.5x the table size.
You say that your system is capable of inserting 3000 records per second without indexes, but only about 100 with two additional non-clustered indexes. If 3k/s is the maximum throughput your I/O permits, adding two indexes should in theory reduces the throughput at about 1000-1500/sec. Instead you see a degradation 10 times worse. The proper solution and answer is 'It Dependts' and some serious troubleshooting and bottleneck identification would have to be carried out. With that in mind, if I was to venture a guess, I'd give two possible culprits:
A. Th additional non-clustered indexes distribute the writes of dirty pages into more allocation areas. The solution would be to place the the clustered index and each non-clustered index into its own filegroup and place the three filegroups each onto separate LUNs on the RAID.
B. The low selectivity of the non-clustered indexes creates high contention between reads and writes (key conflicts as well as %lockres% conflicts) resulting in long lock wait times for both inserts and selects. Possible solutions would be using SNAPSHOTs with read committed snapshot mode, but I must warn about the danger of adding lot of IO in the version store (ie. in tempdb) on system that may already be under high IO stress. A second solution is using database snapshots for reporting, they cause lower IO stress and they can be better controlled (no tempdb version store involved), but the reporting is no longer on real-time data.
I tend to believe B) as the likely cause, but I must again stress the need to proper investigation and proper root case analysis.
'RAID10' is not a very precise description.
How many spindles in the RAID 0 part? Are they short-striped?
How many LUNs?
Where is the database log located?
Where is the database located?
How many partitions?
Where is tempdb located?
As on the question whether relational databases are appropriate for something like this, yes, absolutely. There are many more factors to consider, recoverability, availability, toolset ecosystem, know-how expertise, ease of development, ease of deployment, ease of management and so on and so forth. Relational databases can easily handle your workload, they just need the proper tuning. 30 million inserts a day, 350 per second, is small change for a database server. But a 32bit 4GB RAM system hardly a database server, regardless the number of CPUs.
It sounds like you may be suffering from two particular problems. The first issue that you are hitting is that your indexes require rebuilding everytime you perform an insert - are you really trying to run live reports of a transactional server (this is usually considered a no-no)? Secondly, you may also be hitting issues with the server having to resize the database - check to ensure that you have allocated enough space and aren't relying on the database to do this for you.
Have you considered looking into something like indexed views in SQL Server? They are a good way to remove the indexing from the main table, and move it into a materialised view.
You could try making the table a partitioned one. This way the index updates will affect smaller sets of rows. Probably daily partitioning will be sufficient. If not, try partitioning by the hour!
You aren't providing enough information; I'm not certain why you say that a relational database seems like a bad fit, other than the fact that you're experiencing performance problems now. What sort of machine is the RDBMS running on? Given that you have foreign ID's, it seems that a relational database is exactly what's called for here. SQL Server should be able to handle 30 million inserts per day, assuming that it's running on sufficient hardware.
Replicating the database for reporting seems like the best route, given heavy traffic. However, a couple of things to try first...
Go with a single index, not two indexes. A clustered index is probably going to be a better choice than non-clustered. Fewer, wider indexes will generally perform better than more, narrower, indexes. And, as you say, it's the indexing that's killing your app.
You don't say what you're using for IDs, but if you're using GUIDs, you might want to change your keys over to bigints. Because GUIDs are random, they put a heavy burden on indexes, both in building indexes and in using them. Using a bigint identity column will keep the index running pretty much chronological, and if you're really interested in real-time access for queries on your recent data, your access pattern is much better suited for monotonically increasing keys.
Sybase IQ seems pretty good for the goal as our architects/DBAs indicated (as in, they explicitly move all our stats onto IQ stating that capability as the reason). I can not substantiate myself though - merely nod at the people in our company who generally know what they are talking about from past experience.
However, I'm wondering whether you MUST store all 30mm records? Would it not be better to store some pre-aggregated data?
Not sure about SQL server but in another database system I have used long ago, the ideal method for this type activity was to store the updates and then as a batch turn off the indexes, add the new records and then reindex. We did this once per night. I'm not sure if your reporting needs would be a fit for this type solution or even if it can be done in MS SQL, but I'd think it could.
You don't say how the inserts are managed. Are they batched or is each statistic written separately? Because inserting one thousand rows in a single operation would probably be way more efficient than inserting a single row in one thousand separate operations. You could still insert frequently enough to offer more-or-less real time reporting ;)

ensuring database performance as data volume increases

I am currently into a performance tuning exercise. The application is DB intensive with very little processing logic. The performance tuning is around the way DB calls are made and the DB itself.
We did the query tuning, We put the missing indexes, We reduced or eliminated DB calls where possible. The application is performing very well and all is fine.
With smaller data volume (say upto 100,000 records), the performance is fantastic. My Question is, what needs to be done to ensure such good performance at higher data volumes ?
The data volumes are expected to reach 10 million records.
I can think of table and index partitioning, suggesting filesystems optimized for DB storage and periodic archiving to keep the number of rows in check. I would like to know what else could be done. Any tips/strategies/patterns would be very helpful.
Monitoring. Use some tools to monitor performance, and saturation of CPU, memory, and I/O. Make trend lines so you know where your next bottleneck will be before you get there.
Testing. Create mock data so you have 10 million rows on a testing server today. Benchmark the queries you have in your application and see how well they perform as the volume of data increases. You might be surprised at what breaks down first, or it may go exactly as predicted. The point is that you can find out.
Maintenance. Make sure your application and infrastructure support some downtime, because that's always necessary. You might have to defrag and rebuild your indexes. You might have to refactor some of the table structure. You might have to upgrade the server software or apply patches. To do this without interrupting continuous operation, you'll need some redundancy built in to the design.
Research. Find the best journals and blogs for the database brand you're using, and read them (e.g. http://www.mysqlperformanceblog.com if you use MySQL). You can ask good questions like the one you ask here, but also read what other people are asking, and what they're being advised to do about it. You can learn solutions to problems that you don't even have yet, so that when you have them, you'll have some strategies to employ.
Different databases need to be tuned in different ways. What RDBMS are you using?
Also, how do you know whether or not what you have done so far will result in poor performance with larger data sets? Have you tested your current optimisations with a large amount of test data?
When you did this, how did the performance change? If you are able to tune the database so that it performs with the data it has now, there's no reason to think that your methods won't work with a larger data set.
Depending on the RDBMS, the next type of solution is simple: get bigger, beefier hardware. More RAM, more disks, more CPUs.
You are on the right way:
1) Proper indexes
2) DBMS options tuning (memory caches, buffers, internal threads control and so on)
3) Queries tuning (especially log slow queries and then tune/rewrite them)
4) To tune your queries and indexes you may need to research your queries execution plans
5) Poweful dedicated server
6) Think about queries which your client applications send to the database. Are they always necessary? Do you need all the data you ask for? Is it possible to cache some data?
10 million records is probably too small to bother with partitioning. Typically partitioning will only be interesting if your data volumes are an order or magnitude or so bigger than that.
Index tuning for a database with 100,000 rows will probably get you 99% of what you need with 10 million rows. Keep an eye out for table scans or index range scans on the large tables in the system. On smaller tables they are fine and in some cases even optimal.
Archiving old data may help but this is probably overkill for 10 million rows.
One possible optimisation is to move reporting off onto a separate server. This will reduce the burden on the server - reports are often quite anti-social when run on operational systems as the schema tends not to be well optimised for it.
You can use database replication to do this or make a data mart for reporting. Replication is easier to implement but the reports will be less efficient, no more efficient than they were on the production system. Building a star schema data mart will be more efficient for reporting but incur additional development work.

Resources