We want to use artificial keys in our data warehouse.
Is it a good idea to use bigint as data type for the artificial keys? I think because it's 64-bit, it can be very fast processed by the CPU. Or is this wrong?
I basically use the smallest type possible - in order to read less data when operations are performed.
So, depending on the range choose the best type for you. I guess INT will be small for data warehouse (remember when the YouTube counter was broken by the video hitting 2,147,483,647 views - the upper bound of the INT range.
How fast would be CPU processing if the key is BIGINT? Don't worry about such stuff - the processing depends on other factors - indexes, row-size, execution plans, is able the engine to process rows in batches and so on. How good is the engine with certain data types is your latest worry and even if it's slow, I doubt you can use INT because of the business requirements.
While your computer will have 64 bit registers, much of the challenge in performance is about fitting things into memory. It isn't helpful to your overall performance to make objects larger than they need to be. 64 bit systems do 32 bit arithmetic just fine.
int is a pretty useful data type, and even if you stick to positive values (which I suggest you do as they compress better), you still have well over 2 billion values. If you're dealing with values like customers or products, that's never going to be too small.
If you're at a gigantic site and deal with very large numbers of transactions, you might well want to use bigint for those.
I'm designing a database that will need to be optimized for maximum speed.
All the database data is generated once from something I call an input database (which holds the data I'm editing, mainly some polylines, markers, etc for google maps).
So the database is not subject to editing, but it needs to hold as many data as it can for quickly displaying results to the user (routes across town, custom polylines, etc).
The question is: choosing smaller data types for example like smallint over int will improve performance or it will affect it? Space is not quite a problem, after some quick calculations, the database will not exceed 200mb, and there will not be tables with more than 100.000 rows (average will be around 5.000).
I'm asking this because I read some articles around the internet and some say that smaller data types improve performance others say that it affects it because additional processing must be done. I'm aware that for smaller databases probably results are not noticeable, but I'm interested in every bit because I'm expecting many requests which will trigger a lot more queries.
The hosting environment is gonna be Windows Server 2008 R2 with SQL Server 2008 R2.
EDIT 1: Just to give you an example because I don't have a proper table structure yet:
I'm going to have a table which will hold public transportation lines (somewhere around 200), identified by a unique number in real life, and which is going to be referenced in all sorts of tables and on which all sorts of operations are going to be made. These referencing tables will hold the largest amount of data.
Because lines have unique numbers, I have thought of 3 examples of designs:
The PK is the line number of datatype: smallint
The PK is the line number of datatype: int
The PK is something different (identity for example) and the line number is stored in a different field.
Just for the sake of argument, because I used this on the 'input database' which is not subject to optimization, the PK is a GUID (16 bytes); if you like, you can make a comparison of how bad is this compared to others, if it really is
So keep in mind that the PK is going to be referenced in at least 15 tables, some of which will have over 50.000 rows (the rest averaging 5.000 as I said above) which are going to be subject to constant querying and manipulation, and I'm interested in every bit of speed that I can get.
I can detail this even more if you need. Thanks
EDIT 2: And another question related to this came to my mind, think it fits into this discussion:
Will I see any performance improvements in this specific scenario if I use native SQL queries from inside my .NET application rather than using LINQ to SQL? I know LINQ is strongly optimized and generates very good queries performance-wise, but still, sure worth asking. Thanks again.
Can you point to some articles that say that smaller data types = more processing? Keeping in mind that even with SSDs most workloads today are I/O-bound (or memory-bound) and not CPU-bound.
Particularly in cases where the PK is going to be referenced in many tables, it will be beneficial to use the smallest data type possible. In this case if that's a SMALLINT then that's what I would use (though you say there are about 200 values, so theoretically you could use TINYINT which is half the size and supports 0-255). Where you need to exercise caution is if you aren't 100% sure that there will always be ~200 values. Once you need 256 you're going to have to change the data type in all of the affected tables, and this is going to be a pain. So sometimes a trade-off is made between accommodating future growth and squeezing the absolute most performance today. If you don't know for certain that you will never exceed 255 or 32,000 values then I would probably just an INT. Unless you also don't know that you won't ever exceed 2 billion values, in which case you would use BIGINT.
The difference between INT/SMALLINT/TINYINT is going to be more noticeable in disk space than in performance. (And if you're on Enterprise, the differences in both disk space and performance can be offset quite a bit using data compression - particularly while your INT values all fit within SMALLINT/TINYINT, though in the latter case it really will be negligible because the values are unique.) On the other hand, the difference between any of these and GUID is going to be much more noticeable in both performance and disk space. Marc gave some great links from Kimberly; I wrote this article in 2003 and while it's a little dated it does contain most of the salient points that are still relevant today.
Another trade-off that sometimes needs to be considered (though not in your specific case, it seems) is whether values need to be unique across multiple systems. This is where you might need to sacrifice some performance in order to meet business requirements. In a lot of cases folks take the easy way and resign themselves to GUID. But there are other solutions too, such as identity ranges, a central custom sequence generator, and the new SEQUENCE object in SQL Server 2012. I wrote about SEQUENCE back in 2010 when the first public beta of SQL Server 2012 was released.
I think you will need to provide some more details about the tables structure and sample queries that will be running against them. Based on the information that you have provided I believe that impact of choosing smaller data types will be just a couple of percents and I would suggest to give higher attention to indexes that you will have. SQL Server does a good job on suggesting what indexes to create by providing you with execution plans for your queries and tuning advisor tool
One suggestion that I have is to incorporate a decimal datatype instead of using a combination of fields. For example, instead of having a table with Date (YYYYMMDD), Store (SSSS), and Item (IIII), I would recommend...YYYYMMDD.SSSSIIII. Especially when querying multiple tables with this same key combination, it dramatically improves processing time.
In all the applications I have made where a database is used I typically store the calculated value along with the variables needed to calculate that value. For example, if I have tonnage and cost I would multiply them to calculate the total. I could just recalculate the value every time it is needed, I was just wondering if there was an standard approach. Either way is fine with me, I just want to do what is most common.
If I store the calculate variables it makes my domain classes a bit more complex, but makes my controller logic cleaner, if I don't store the calculated variables it is the other way around.
The calculations would not be extremely frequent, but may be moderately frequent, but math is cheap right?
The standard approach is not to store this kind of calculated values - it breaks normalization.
There are cases you want to store calculated values, if it takes too long to recalculate, or you are running a data warehouse etc. In your case, you want stick to the normalization rules.
This violates Normal Form to have this calculated value. Unless there is a reason to denormalize (usually performance constraints) then you should make every attempt to normalize your tables, it will make your database much easier to maintain/improve and denormalize may lock you into a design that is difficult to alter easily and exposes your data to inconsistencies and redundancy.
In my experience, the most common thing to do is to a) store the calculated value, b) without any CHECK constraints in the database that would guarantee that the value is correct.
The right thing to do is either
don't store the result of the calculation
store the calculated value in a column that's validated with a CHECK constraint.
MySQL doesn't support CHECK constraints. So your options are
don't store the result of the calculation
switch to a dbms that supports CHECK constraints, such as PostgreSQL.
It all depends on what resources are scarce in your environment. If you do pre-calculate the value, you'll save CPU time at the cost of increased network usage and DB storage space. These days, CPU time is generally much more abundant than network bandwidth and DB storage, so I'm going to guess that as long as the calculation isn't too complicated then pre-calculating the value is not worth it.
On the other hand, perhaps the value you're calculating takes a substantial amount of CPU. In this case, you may want to cache that value in the DB.
So, it depends on what you have and what you lack.
Simple math is relatively cheap, however you need to weigh up the additional storage cost vs performance saving when storing these values. Another thing you may want to consider is the affect this will have on data updates, where you cant simply just update the field value, you need to update the calculated value too.
I was recently on the OEIS (Online Encyclopedia of Integer Sequences) recently, trying to look up a particular sequence I had on had.
Now, this database is fairly large. The website states that if the 2006 (! 5 years old) edition were printed, it would occupy 750 volumes of text.
I'm sure this is the same sort of issue Google has to handle as well. But, they also have a distributed system where they take advantage of load balancing.
Neglecting load balancing however, how much time does it take to do a query compared to database size?
Or in other words, what is the time complexity of a query with respect to DB size?
Edit: To make things more specific, assume the input query is simply looking up a string of numbers such as:
1, 4, 9, 16, 25, 36, 49
It strongly depends on the query, structure of the database, contention, and so on. But in general most databases will find a way to use an index, and that index will either be some kind of tree structure (see http://en.wikipedia.org/wiki/B-tree for one option) in which case access time is proportional to log(n), or else a hash in which case access time is proportional to O(1) on average (see http://en.wikipedia.org/wiki/Hash_function#Hash_tables for an explanation of how they work).
So the answer is typically O(1) or O(log(n)) depending on which type of data structure is used.
This may cause you to wonder why we don't always use hash functions. There are multiple reasons. Hash functions make it hard to retrieve ranges of values. If the hash function fails to distribute data well, it is possible for access time to become O(n). Hashes need resizing occasionally, which is potentially very expensive. And log(n) grows slowly enough that you can treat it as being reasonably close to constant across all practical data sets. (From 1000 to 1 petabyte it varies by a factor of 5.) And frequently the actively requested data shows some sort of locality, which trees do a better job of keeping in RAM. As a result trees are somewhat more commonly seen in practice. (Though hashes are by no means rare.)
That depends on a number of factors including the database engine implementation, indexing strategy, specifics of the query, available hardware, database configuration, etc.
There is no way to answer such a general question.
A properly designed and implemented database with terabytes of data may actually outperform a badly designed little database (particulaly one with no indexing and one that uses badly performing non-sargable queries and things such as correlated subqueries). This is why anyone expecting to have large amounts of data needs to hire an expert on databse design for large databases to do the intial design not later when the database is large. You may also need to invest in the type of equipment you need to handle the size as well.
I am creating an little hobby database driven browser based game and I stumbled across this problem: I store money owned by users as an 32bit integer field (to be precise: two fields. One stores money in players hand, the other - money stored in bank). We all know, that maximum value, which can be stored in 32 bits is 2^32-1.
I am absolutelly sure, that 95% of players will not be able to reach the upper limit - but on the other hand (and after doing some calculations today) good players will be able to accumulate that much.
Having that in mind I came with the following ideas:
store money in 64bits, which doubles space of each record.
store money as string and convert to/from long long in the runtime.
change game mechanics so players will not be able to gain that amount of wealth.
I know that existence of reachable upper limit is rather limiting for some players, so for me the third option is worst from the proposed ones.
Are there any other ways of dealing with this kind of problems? Which one would You go for?
Taking an example from the real world, why not have different types of coins e.g a column for a million units of the currency.
Changing to a larger datatype is likely the easiest solution and considerations of disk space/memory aren't likely to be significant unless your game is huge in scale. Have 5,000 users playing your game? Changing from 32-bits to 64-bits will consume roughly 20k extra. That's not enough to lose any sleep over.
The best answer would likely come from someone familiar with how banks handle these types of situations, though their requirements may be far more complicated than what you need.
Space on memory shouldn't be a problem depending on the amount of players you'll have simultaneously, but storing as string will definitely use more disk space.
But seriously, 4 294 967 296 rupees/simoleons/furlongs? Who are they? Sim Gates?
Why not store money the way it should be stored, as a Money data type? This is assume of course you are using SQL Server. The money data type won't have this limitation and won't be affected by rounding issues.
Usually when I'm creating indexes on tables, I generally guess what the Fill Factor should be based on an educated guess of how the table will be used (many reads or many writes).
Is there a more scientific way to determine a more accurate Fill Factor value?
You could try running a big list of realistic operations and looking at IO queues for the different actions.
There are a lot of variables that govern it, such as the size of each row and the number of writes vs reads.
Basically: high fill factor = quicker read, low = quicker write.
However it's not quite that simple, as almost all writes will be to a subset of rows that need to be looked up first.
For instance: set a fill factor to 10% and each single-row update will take 10 times as long to find the row it's changing, even though a page split would then be very unlikely.
Generally you see fill factors 70% (very high write) to 95% (very high read).
It's a bit of an art form.
I find that a good way of thinking of fill factors is as pages in an address book - the more tightly you pack the addresses the harder it is to change them, but the slimmer the book. I think I explained it better on my blog.
I would tend to be of the opinion that if you're after performance improvements, your time is much better spent elsewhere, tweaking your schema, optimising your queries and ensuring good index coverage. Fill factor is one of those things that you only need to worry about when you know that everything else in your system is optimal. I don't know anyone that can say that.