Im trying to work out the best way scale my site, and i have a question on how mssql will scale.
The way the table currently is:
cache_id - int - identifier
cache_name - nvchar 256 - Used for lookup along with event_id
cache_event_id - int - Basicly a way of grouping
cache_creation_date - datetime
cache_data - varbinary(MAX) - Data size will be from 2k to 5k
The data stored is a byte array, thats basically a cached instance (compressed) of a page on my site.
The different ways i see storing i see are:
1) 1 large table, it would contain tens millions of records and easily become several gigabytes in size.
2) Multiple tables to contain the data above, meaning each table would 200k to a million records.
The data will be used from this table to show web pages, so anything over 200ms to get a record is bad in my eyes ( I know some ppl think 1-2 seconds page load is ok, but i think thats slow and want to do my best to keep it lower).
So it boils down to, what is it that slows down the SQL server?
Is it the size of the table ( disk space )
Is the the number of rows
At what point does it stop becoming cost effective to use multiple database servers?
If its close to impossible to predict these things, il accept that as a reply to. Im not a DBA, and im basically trying to design my DB so i dont have to redesign it later when its it contains huge amount of data.

This is all a 'rule of thumb' view;
Load (and therefore to a considerable extent performance) of a DB is largely a factor of 2 issues data volumes and transaction load, with IMHO the second generally being more relevant.
With regards the data volume one can hold many gigabytes of data and get acceptable access times by way of Normalising, Indexing, Partitioning, Fast IO systems, appropriate buffer cache sizes, etc. Many of these, e.g. Normalisation are the issues that one considers at DB design time, others during system tuning, e.g. additional/less indexes, buffer cache size.
The transactional load is largely a factor of code design and total number of users. Code design includes factors like getting transaction size right (small and fast is the general goal, but like most things it is possible to take it to far and have transactions that are too small to retain integrity or so small as to in itself add load).
When scaling I advise first scale up (bigger, faster server) then out (multiple servers). The admin issues of a multiple server instance are significant and I suggest only worth considering for a site with OS, Network and DBA skills and processes to match.

Normalize and index.
How, we can't tell you, because you haven't told use what your table is trying to model or how you're trying to use it.
1 million rows is not at all uncommon. Again, we can't tell you much in the absence of context only you can, but don't, provide.

The only possible answer is to set it up, and be prepared for a long iterative process of learning things only you will know because only you will live in your domain. Any technical advice you see here will be naive and insufficiently informed until you have some practical experience to share.
Test every single one of your guesses, compare the results, and see what works. And keep looking for more testable ideas. (And don't be afraid to back out changes that end up not helping. It's a basic requirement to have any hope of sustained simplicity.)
And embrace the fact that your database design will evolve. It's not as fearsome as your comment suggests you think it is. It's much easier to change a database than the software that goes around it.


Choosing SQL Server data types for maximum speed

I'm designing a database that will need to be optimized for maximum speed.
All the database data is generated once from something I call an input database (which holds the data I'm editing, mainly some polylines, markers, etc for google maps).
So the database is not subject to editing, but it needs to hold as many data as it can for quickly displaying results to the user (routes across town, custom polylines, etc).
The question is: choosing smaller data types for example like smallint over int will improve performance or it will affect it? Space is not quite a problem, after some quick calculations, the database will not exceed 200mb, and there will not be tables with more than 100.000 rows (average will be around 5.000).
I'm asking this because I read some articles around the internet and some say that smaller data types improve performance others say that it affects it because additional processing must be done. I'm aware that for smaller databases probably results are not noticeable, but I'm interested in every bit because I'm expecting many requests which will trigger a lot more queries.
The hosting environment is gonna be Windows Server 2008 R2 with SQL Server 2008 R2.
EDIT 1: Just to give you an example because I don't have a proper table structure yet:
I'm going to have a table which will hold public transportation lines (somewhere around 200), identified by a unique number in real life, and which is going to be referenced in all sorts of tables and on which all sorts of operations are going to be made. These referencing tables will hold the largest amount of data.
Because lines have unique numbers, I have thought of 3 examples of designs:
The PK is the line number of datatype: smallint
The PK is the line number of datatype: int
The PK is something different (identity for example) and the line number is stored in a different field.
Just for the sake of argument, because I used this on the 'input database' which is not subject to optimization, the PK is a GUID (16 bytes); if you like, you can make a comparison of how bad is this compared to others, if it really is
So keep in mind that the PK is going to be referenced in at least 15 tables, some of which will have over 50.000 rows (the rest averaging 5.000 as I said above) which are going to be subject to constant querying and manipulation, and I'm interested in every bit of speed that I can get.
I can detail this even more if you need. Thanks
EDIT 2: And another question related to this came to my mind, think it fits into this discussion:
Will I see any performance improvements in this specific scenario if I use native SQL queries from inside my .NET application rather than using LINQ to SQL? I know LINQ is strongly optimized and generates very good queries performance-wise, but still, sure worth asking. Thanks again.
Can you point to some articles that say that smaller data types = more processing? Keeping in mind that even with SSDs most workloads today are I/O-bound (or memory-bound) and not CPU-bound.
Particularly in cases where the PK is going to be referenced in many tables, it will be beneficial to use the smallest data type possible. In this case if that's a SMALLINT then that's what I would use (though you say there are about 200 values, so theoretically you could use TINYINT which is half the size and supports 0-255). Where you need to exercise caution is if you aren't 100% sure that there will always be ~200 values. Once you need 256 you're going to have to change the data type in all of the affected tables, and this is going to be a pain. So sometimes a trade-off is made between accommodating future growth and squeezing the absolute most performance today. If you don't know for certain that you will never exceed 255 or 32,000 values then I would probably just an INT. Unless you also don't know that you won't ever exceed 2 billion values, in which case you would use BIGINT.
The difference between INT/SMALLINT/TINYINT is going to be more noticeable in disk space than in performance. (And if you're on Enterprise, the differences in both disk space and performance can be offset quite a bit using data compression - particularly while your INT values all fit within SMALLINT/TINYINT, though in the latter case it really will be negligible because the values are unique.) On the other hand, the difference between any of these and GUID is going to be much more noticeable in both performance and disk space. Marc gave some great links from Kimberly; I wrote this article in 2003 and while it's a little dated it does contain most of the salient points that are still relevant today.
Another trade-off that sometimes needs to be considered (though not in your specific case, it seems) is whether values need to be unique across multiple systems. This is where you might need to sacrifice some performance in order to meet business requirements. In a lot of cases folks take the easy way and resign themselves to GUID. But there are other solutions too, such as identity ranges, a central custom sequence generator, and the new SEQUENCE object in SQL Server 2012. I wrote about SEQUENCE back in 2010 when the first public beta of SQL Server 2012 was released.
I think you will need to provide some more details about the tables structure and sample queries that will be running against them. Based on the information that you have provided I believe that impact of choosing smaller data types will be just a couple of percents and I would suggest to give higher attention to indexes that you will have. SQL Server does a good job on suggesting what indexes to create by providing you with execution plans for your queries and tuning advisor tool
One suggestion that I have is to incorporate a decimal datatype instead of using a combination of fields. For example, instead of having a table with Date (YYYYMMDD), Store (SSSS), and Item (IIII), I would recommend...YYYYMMDD.SSSSIIII. Especially when querying multiple tables with this same key combination, it dramatically improves processing time.

Database choice: High-write, low-read

I'm building a component for recording historical data. Initially I expect it to do about 30 writes/second, and less than 1 read/second.
The data will never be modified, only new data will be added. Reads are likely to be done with fresh records.
The demand is likely to increase rapidly, expecting around 80 writes/second in one year time.
I could choose to distribute my component and use a common database such as MySql, or I could go with a distributed database such as MongoDb. Either way, I'd like the database to handle writes very well.
The database must be free. Open source would be a plus :-)
Note: A record is plain text in variable size, typically 50 to 500 words.
Your question can be solved a few different ways, so let's break it down and look at the individual requirements you've laid out:
Writes - It sounds like the bulk of what you're doing is append only writes at a relatively low volume (80 writes/second). Just about any product on the market with a reasonable storage backend is going to be able to handle this. You're looking at 50-500 "words" of data being saved. I'm not sure what constitutes a word, but for the sake of argument let's assume that a word is an average of 8 characters, so your data is going to be some kind of metadata, a key/timestamp/whatever plus 400-4000 bytes of words. Barring implementation specific details of different RDBMSes, this is still pretty normal, we're probably writing at most (including record overhead) 4100 bytes per record. This maxes out at 328,000 bytes per second or, as I like to put it, not a lot of writing.
Deletes - You also need the ability to delete your data. There's not a lot I can say about that. Deletes are deletes.
Reading - Here's where things get tricky. You mention that it's mostly primary keys and reads are being done on fresh data. I'm not sure what either of these mean, but I don't think that it matters. If you're doing key only lookups (e.g. I want record 8675309), then life is good and you can use just about anything.
Joins - If you need the ability to write actual joins where the database handles them, you've written yourself out of the major non-relational database products.
Data size/Data life - This is where things get fun. You've estimated your writes at 80/second and I guess at 4100 bytes per record or 328,000 bytes per second. There are 86400 seconds in a day, which gives us 28,339,200,000 bytes. Terrifying! That's 3,351,269.53125 KB, 27,026 MB, or roughly 26 GB / day. Even if you're keeping your data for 1 year, that's 9633 GB, or 10TB of data. You can lease 1 TB of data from a cloud hosting provider for around $250 per month or buy it from a SAN vendor like EqualLogic for about $15,000.
Conclusion: I can only think of a few databases that couldn't handle this load. 10TB is getting a bit tricky and requires a bit of administration skill, and you might need to look at certain data lifecycle management techniques, but almost any RDBMS should be up to this task. Likewise, almost any non-relational/NoSQL database should be up to this task. In fact, almost any database of any sort should be up to the task.
If you (or your team members) already have skills in a particular product, just stick with that. If there's a specific product that excels in your problem domain, use that.
This isn't the type of problem that requires any kind of distributed magical unicorn powder.
Ok for MySQL I would advice you to use InnoDB without any indexes, expect on primary keys, even then, if you can skip them it would be good, in order to make input flow uninterrupted.
Indexes optimize reading, but descrease the writing capabilities.
You also may use PostgreSQL. Where you also need to skip indexes as well but you wont have a engine selection and its capabilities are also very strong for writing.
This approach you want is actually used in some solutions, but with two db servers, or at least two databases. The first is receiving a lot of new data (your case), while the second communicates with the first and store it in a well-structured database (with indexes, rules, etc). And then when you need to read or make a snapshot of the data you refer the second server (or second database), where you can use transactions and so on.
You should take a look and refer at Oracle Express (I think this was its name) and SQL Server Express Edition. The last two have better performance, but also some limitations. To have more detailed picture.

When is the size of the database call more expensive than the frequency of calls?

Can someone give me a relative idea of when it makes more sense to hit the database many times for small query results vs caching a large number of rows and querying that?
For example, if I have a query returning 2,000 results. And then I have additional queries on those results that take maybe 10-20 items, would it be better to cache the 2000 results or hit the database every time for each set of 10 or 20 results?
Other answers here are correct -- the RDBMS and your data are key factors. However, another key factor is how much time it will take to sort and/or index your data in memory versus in the database. We have one application where, for performance, we added code to grab about 10,000 records into an in-memory DataSet and then do subqueries on that. As it turns out, keeping that data up to date and selecting out subsets is actually slower than just leaving all the data in the database.
So my advice is: do it the simplest possible way first, then profile it and see if you need to optimize for performance.
It depends on a variety of things. I will list some points that come to mind:
If you have a .Net web app that is caching data in the client, you do not want to pull 2k rows.
If you have a web service, they are almost always better Chunky than Chatty because of the added overhead of XML on the transport.
In a fairly decently normalized and optimized database, there really should be very few times that you have to pull 2k rows out at a time unless you are doing reports.
If the underlying data is changing at a rapid pace, then you should really be careful caching it on the middle tier or the presentation layer because what you present will you will be out of date.
Reports (any DSS) will pull and chomp through much larger data sets, but since they are not interactive, we denormalize and let them have their fun.
In cases of cascading dropdowns and such, AJAX techniques will prove to be more efficient and effective.
I guess I'm not really giving you one answer to your question. "It depends" is the best I can do.
Unless there is a big performance problem (e.g. a highly latent db connection), I'd stick with leaving the data in the database and letting the db take care of things for you. A lot of things are done efficiently on the database level, for example
isolation levels (what happens if other transactions update the data you're caching)
fast access using indexes (the db may be quicker to access a few rows than you searching through your cached items, especially if that data already is in the db cache like in your scenario)
updates in your transaction to the cached data (do you want to deal with updating your cached data as well or do you "refresh" everything from the db)
There are a lot of potential issues you may run into if you do your own caching. You need to have a very good performance reason befor starting to take care of all that complexity.
So, the short answer: It depends, but unless you have some good reasons, this smells like premature optimizaton to me.
in general, network round trip latency is several orders of magnitude greater than the capacity of a database to generate and feed data onto the network, and the capacity of a client box to consume it from a network connection.
But look at the width of your network bus ( Bits/sec ) and compare that to the average round trip time for a database call...
On 100baseT ethernet, for example you are about 12 MBytes / sec data transfer rate. If your average round trip time is say, 200 ms, then your network bus can deliver 3 MBytes in each 200 ms round trip call..
If you're on gigabit ethernet, that number jumps to 30 Mbytes per round trip...
So if you split up a request for data into two round trips, well that's 400 ms, and each query would have to be over 3Mb (or 30Mb for gigibit ) before that would be faster...
This likely varies from RDBMS to RDBMS, but my experience has been that pulling in bulk is almost always better. After all, you're going to have to pull the 2000 records anyway, so you might as well do it all at once. And 2000 records isn't really a large amount, but that depends largely on what you're doing.
My advice is to profile and see what works best. RDBMSes can be tricky beasts performance-wise and caching can be just as tricky.
"I guess I'm not really giving you one answer to your question. "It depends" is the best I can do."
yes, "it depends". It depends on the volatility of the data that you are intending to cache, and it depends on the level of "accuracy" and reliability that you need for the responses that you generate from the data that you intend to cache.
If volatility on your "base" data is low, then any caching you do on those data has a higher probability of remaining valid and correct for a longer time.
If "caching-fault-tolerance" on the results you return to your users is zero percent, you have no option.
The type of data your bringing back affects the decision as well. You don't want to be caching volatile data or data for potential updates that may get stale.

Advice on building a fast, distributed database

I'm currently working on a problem that involves querying a tremendous amount of data (billions of rows) and, being somewhat inexperienced with this type of thing, would love some clever advice.
The data/problem looks like this:
Each table has 2-5 key columns and 1 value column.
Every row has a unique combination of keys.
I need to be able to query by any subset of keys (i.e. key1='blah' and key4='bloo').
It would be nice to able to quickly insert new rows (updating the value if the row already exists) but I'd be satisfied if I could do this slowly.
Currently I have this implemented in MySQL running on a single machine with separate indexes defined on each key, one index across all keys (unique) and one index combining the first and last keys (which is currently the most common query I'm making, but that could easily change). Unfortunately, this is quite slow (and the indexes end up taking ~10x the disk space, which is not a huge problem).
I happen to have a bevy of fast computers at my disposal (~40), which makes the incredible slowness of this single-machine database all the more annoying. I want to take advantage of all this power to make this database fast. I've considered building a distributed hash table, but that would make it hard to query for only a subset of the keys. It seems that something like BigTable / HBase would be a decent solution but I'm not yet convinced that a simpler solution doesn't exist.
Thanks very much, any help would be greatly appreciated!
I'd suggest you listen to this podcast for some excellent information on distributed databases.
To point out the obvious: you're probably disk bound.
At some point if you're doing randomish queries and your working set is sufficiently larger than RAM then you'll be limited by the small number of random IOPS a disk can do. You aren't going to be able to do better than a few tens of sub-queries per second per attached disk.
If you're up against that bottleneck, you might gain more by switching to an SSD, a larger RAID, or lots-of-RAM than you would by distributing the database among many computers (which would mostly just get you more of the last two resources)

Database scalability - performance vs. database size

I'm creating an app that will have to put at max 32 GB of data into my database. I am using B-tree indexing because the reads will have range queries (like from 0 < time < 1hr).
At the beginning (database size = 0GB), I will get 60 and 70 writes per millisecond. After say 5GB, the three databases I've tested (H2, berkeley DB, Sybase SQL Anywhere) have REALLY slowed down to like under 5 writes per millisecond.
Is this typical?
Would I still see this scalability issue if I REMOVED indexing?
What are the causes of this problem?
Each record consists of a few ints
Yes; indexing improves fetch times at the cost of insert times. Your numbers sound reasonable - without knowing more.
You can benchmark it. You'll need to have a reasonable amount of data stored. Consider whether or not to index based upon the queries - heavy fetch and light insert? index everywhere a where clause might use it. Light fetch, heavy inserts? Probably avoid indexes. Mixed workload; benchmark it!
When benchmarking, you want as real or realistic data as possible, both in volume and on data domain (distribution of data, not just all "henry smith" but all manner of names, for example).
It is typical for indexes to sacrifice insert speed for access speed. You can find that out from a database table (and I've seen these in the wild) that indexes every single column. There's nothing inherently wrong with that if the number of updates is small compared to the number of queries.
However, given that:
1/ You seem to be concerned that your writes slow down to 5/ms (that's still 5000/second),
2/ You're only writing a few integers per record; and
3/ You're queries are only based on time queries,
you may want to consider bypassing a regular database and rolling your own sort-of-database (my thoughts are that you're collecting real-time data such as device readings).
If you're only ever writing sequentially-timed data, you can just use a flat file and periodically write the 'index' information separately (say at the start of every minute).
This will greatly speed up your writes but still allow a relatively efficient read process - worst case is you'll have to find the start of the relevant period and do a scan from there.
This of course depends on my assumption of your storage being correct:
1/ You're writing records sequentially based on time.
2/ You only need to query on time ranges.
Yes, indexes will generally slow inserts down, while significantly speeding up selects (queries).
Do keep in mind that not all inserts into a B-tree are equal. It's a tree; if all you do is insert into it, it has to keep growing. The data structure allows for some padding, but if you keep inserting into it numbers that are growing sequentially, it has to keep adding new pages and/or shuffle things around to stay balanced. Make sure that your tests are inserting numbers that are well distributed (assuming that's how they will come in real life), and see if you can do anything to tell the B-tree how many items to expect from the beginning.
Totally agree with #Richard-t - it is quite common in offline/batch scenarios to remove indexes completely before bulk updates to a corpus, only to reapply them when update is complete.
The type of indices applied also influence insertion performance - for example with SQL Server clustered index update I/O is used for data distribution as well as index update, where as nonclustered indexes are updated in seperate (and therefore more expensive) I/O operations.
As with any engineering project - best advice is to measure with real datasets (skews page distribution, tearing etc.)
I think somewhere in the BDB docs they mention that page size greatly affects this behavior in btree's. Assuming you arent doing much in the way of concurrency and you have fixed record sizes, you should try increasing your page size
