I have a table in SQL Server with large amount of data - around 40 million rows. The base structure is like this:
Title
type
length
Null distribution
Customer-Id
number
8
60%
Card-Serial
number
5
70%
-
-
-
-
-
-
-
-
Note
string-unicode
2000
40%
Both numeric columns are filled by numbers with specific length.
I have no idea which data type to choose to have a database in the smallest size and having good performance by indexing the customerId column. Refer to this Post if I choose CHAR(8), database consume 8 bytes per row even in null data.
I decided to use INT to reduce the database size and having good index, but null data will use 4 bytes per rows again. If I want to reduce this size, I can use VARCHAR(8), but I don't know, the system has good performance on setting index on this type or not. The main question is reducing database size is important or having good index on numeric type.
Thanks.
If it is a number - then by all means choose a numeric datatype!! Don't store your numbers as char(n) or varchar(n) !! That'll just cause you immeasurable grief and headaches later on.
The choice is pretty clear:
if you have whole numbers - use TINYINT, SMALLINT, INT or BIGINT - depending on the number range you need
if you need fractional numbers - use DECIMAL(p,s) for the best and most robust behaviour (no rounding errors like FLOAT or REAL)
Picking the most appropriate datatype is much more important than any micro-optimization for storage. Even with 40 million rows - that's still not a big issue, whether you use 4 or 8 bytes. Whether you use a numeric type vs. a string type - that makes a huge difference in usability and handling of your database!
I have a static database of ~60,000 rows. There is a certain column for which there are ~30,000 unique entries. Given that ratio (60,000 rows/30,000 unique entries in a certain column), is it worth creating a new table with those entries in it, and linking to it from the main table? Or is that going to be more trouble than it's worth?
To put the question in a more concrete way: Will I gain a lot more efficiency by separating out this field into it's own table?
** UPDATE **
We're talking about a VARCHAR(100) field, but in reality, I doubt any of the entries use that much space -- I could most likely trim it down to VARCHAR(50). Example entries: "The Gas Patch and Little Canada" and "Kora Temple Masonic Bldg. George Coombs"
If the field is a VARCHAR(255) that normally contains about 30 characters, and the alternative is to store a 4-byte integer in the main table and use a second table with a 4-byte integer and the VARCHAR(255), then you're looking at some space saving.
Old scheme:
T1: 30 bytes * 60 K entries = 1800 KiB.
New scheme:
T1: 4 bytes * 60 K entries = 240 KiB
T2: (4 + 30) bytes * 30 K entries = 1020 KiB
So, that's crudely 1800 - 1260 = 540 KiB space saving. If, as would be necessary, you build an index on the integer column in T2, you lose some more space. If the average length of the data is larger than 30 bytes, the space saving increases. If the ratio of repeated rows ever increases, the saving increases.
Whether the space saving is significant depends on your context. If you need half a megabyte more memory, you just got it — and you could squeeze more if you're sure you won't need to go above 65535 distinct entries by using 2-byte integers instead of 4 byte integers (120 + 960 KiB = 1080 KiB; saving 720 KiB). On the other hand, if you really won't notice the half megabyte in the multi-gigabyte storage that's available, then it becomes a more pragmatic problem. Maintaining two tables is harder work, but guarantees that the name is the same each time it is used. Maintaining one table means that you have to make sure that the pairs of names are handled correctly — or, more likely, you ignore the possibility and you end up without pairs where you should have pairs, or you end up with triplets where you should have doubletons.
Clearly, if the type that's repeated is a 4 byte integer, using two tables will save nothing; it will cost you space.
A lot, therefore, depends on what you've not told us. The type is one key issue. The other is the semantics behind the repetition.
Im currently developing an application that needs to store a 10 to 20 digit value into the database.
My question is, what datatype should i need to be using? This digit is used as an primary key, and therefore the performance of the DB is important for my accplication. In Java i use this digit as and BigDecimal.
Quote from the manual:
numeric: up to 131072 digits before the decimal point; up to 16383 digits after the decimal point
http://www.postgresql.org/docs/current/static/datatype-numeric.html
131072 digits should cover your needs as far as I can tell.
Edit:
To answer the question about efficiency:
The first and most important question is: what kind of data is stored in that column and how do you use it?
If it's a number then use numeric.
If it's not a number use a varchar.
Never, ever store (real) numbers in character columns!
If you need to sort by that column you won't be satifisfied with what you get if you use a character datatype (e.g. 2 will be sorted after 10)
Coming back to the efficiency question. I assume this is mostly space efficiency you are concerned. You can calculate the space requirements for your values yourself.
The storage requirement for the numeric data type is documented as well:
The actual storage requirement is two bytes for each group of four decimal digits, plus five to eight bytes overhead
So for 20 digits this would be a maximum of 10 bytes plus the five to eight bytes overhead. So max. 18 bytes.
To store 20 digits in a varchar column you need 21 bytes.
So from a space "efficiency" point of view numeric is slightly better. But that should never influence your decision, because the choice of datatypes should be driven by the requirements of the column's content.
From a performance point of view I don't think there will be a big difference either.
Try BIGINT instead of NUMERIC.It should work.
http://www.postgresql.org/docs/current/static/datatype-numeric.html
Say, i have 10 billions of numbers stored in a file. How would i find the number that has already appeared once previously?
Well i can't just populate billions of number at a stretch in array and then keep a simple nested loop to check if the number has appeared previously.
How would you approach this problem?
Thanks in advance :)
I had this as an interview question once.
Here is an algorithm that is O(N)
Use a hash table. Sequentially store pointers to the numbers, where the hash key is computed from the number value. Once you have a collision, you have found your duplicate.
Author Edit:
Below, #Phimuemue makes the excellent point that 4-byte integers have a fixed bound before a collision is guaranteed; that is 2^32, or approx. 4 GB. When considered in the conversation accompanying this answer, worst-case memory consumption by this algorithm is dramatically reduced.
Furthermore, using the bit array as described below can reduce memory consumption to 1/8th, 512mb. On many machines, this computation is now possible without considering either a persistent hash, or the less-performant sort-first strategy.
Now, longer numbers or double-precision numbers are less-effective scenarios for the bit array strategy.
Phimuemue Edit:
Of course one needs to take a bit "special" hash table:
Take a hashtable consisting of 2^32 bits. Since the question asks about 4-byte-integers, there are at most 2^32 different of them, i.e. one bit for each number. 2^32 bit = 512mb.
So now one has just to determine the location of the corresponding bit in the hashmap and set it. If one encounters a bit which already is set, the number occured in the sequence already.
The important question is whether you want to solve this problem efficiently, or whether you want accurately.
If you truly have 10 billion numbers and just one single duplicate, then you are in a "needle in the haystack" type of situation. Intuitively, short of very grimy and unstable solution, there is no hope of solving this without storing a significant amount of the numbers.
Instead, turn to probabilistic solutions, which have been used in most any practical application of this problem (in network analysis, what you are trying to do is look for mice, i.e., elements which appear very infrequently in a large data set).
A possible solution, which can be made to find exact results: use a sufficiently high-resolution Bloom filter. Either use the filter to determine if an element has already been seen, or, if you want perfect accuracy, use (as kbrimington suggested you use a standard hash table) the filter to, eh, filter out elements which you can't possibly have seen and, on a second pass, determine the elements you actually see twice.
And if your problem is slightly different---for instance, you know that you have at least 0.001% elements which repeat themselves twice, and you would like to find out how many there are approximately, or you would like to get a random sample of such elements---then a whole score of probabilistic streaming algorithms, in the vein of Flajolet & Martin, Alon et al., exist and are very interesting (not to mention highly efficient).
Read the file once, create a hashtable storing the number of times you encounter each item. But wait! Instead of using the item itself as a key, you use a hash of the item iself, for example the least significant digits, let's say 20 digits (1M items).
After the first pass, all items that have counter > 1 may point to a duplicated item, or be a false positive. Rescan the file, consider only items that may lead to a duplicate (looking up each item in table one), build a new hashtable using real values as keys now and storing the count again.
After the second pass, items with count > 1 in the second table are your duplicates.
This is still O(n), just twice as slow as a single pass.
How about:
Sort input by using some algorith which allows only portion of input to be in RAM. Examples are there
Seek duplicates in output of 1st step -- you'll need space for just 2 elements of input in RAM at a time to detect repetitions.
Finding duplicates
Noting that its a 32bit integer means that you're going to have a large number of duplicates, since a 32 bit int can only represent 4.3ish billion different numbers and you have "10 billions".
If you were to use a tightly packed set you could represent whether all the possibilities are in 512 MB, which can easily fit into current RAM values. This as a start pretty easily allows you to recognise the fact if a number is duplicated or not.
Counting Duplicates
If you need to know how many times a number is duplicated you're getting into having a hashmap that contains only duplicates (using the first 500MB of the ram to tell efficiently IF it should be in the map or not). At a worst case scenario with a large spread you're not going to be able fit that into ram.
Another approach if the numbers will have an even amount of duplicates is to use a tightly packed array with 2-8 bits per value, taking about 1-4GB of RAM allowing you to count up to 255 occurrances of each number.
Its going to be a hack, but its doable.
You need to implement some sort of looping construct to read the numbers one at a time since you can't have them in memory all at once.
How? Oh, what language are you using?
You have to read each number and store it into a hashmap, so that if a number occurs again, it will automatically get discarded.
If possible range of numbers in file is not too large then you can use some bit array to indicate if some of the number in range appeared.
If the range of the numbers is small enough, you can use a bit field to store if it is in there - initialize that with a single scan through the file. Takes one bit per possible number.
With large range (like int) you need to read through the file every time. File layout may allow for more efficient lookups (i.e. binary search in case of sorted array).
If time is not an issue and RAM is, you could read each number and then compare it to each subsequent number by reading from the file without storing it in RAM. It will take an incredible amount of time but you will not run out of memory.
I have to agree with kbrimington and his idea of a hash table, but first of all, I would like to know the range of the numbers that you're looking for. Basically, if you're looking for 32-bit numbers, you would need a single array of 4.294.967.296 bits. You start by setting all bits to 0 and every number in the file will set a specific bit. If the bit is already set then you've found a number that has occurred before. Do you also need to know how often they occur?Still, it would need 536.870.912 bytes at least. (512 MB.) It's a lot and would require some crafty programming skills. Depending on your programming language and personal experience, there would be hundreds of solutions to solve it this way.
Had to do this a long time ago.
What i did... i sorted the numbers as much as i could (had a time-constraint limit) and arranged them like this while sorting:
1 to 10, 12, 16, 20 to 50, 52 would become..
[1,10], 12, 16, [20,50], 52, ...
Since in my case i had hundreds of numbers that were very "close" ($a-$b=1), from a few million sets i had a very low memory useage
p.s. another way to store them
1, -9, 12, 16, 20, -30, 52,
when i had no numbers lower than zero
After that i applied various algorithms (described by other posters) here on the reduced data set
#include <stdio.h>
#include <stdlib.h>
/* Macro is overly general but I left it 'cos it's convenient */
#define BITOP(a,b,op) \
((a)[(size_t)(b)/(8*sizeof *(a))] op (size_t)1<<((size_t)(b)%(8*sizeof *(a))))
int main(void)
{
unsigned x=0;
size_t *seen = malloc(1<<8*sizeof(unsigned)-3);
while (scanf("%u", &x)>0 && !BITOP(seen,x,&)) BITOP(seen,x,|=);
if (BITOP(seen,x,&)) printf("duplicate is %u\n", x);
else printf("no duplicate\n");
return 0;
}
This is a simple problem that can be solved very easily (several lines of code) and very fast (several minutes of execution) with the right tools
my personal approach would be in using MapReduce
MapReduce: Simplified Data Processing on Large Clusters
i'm sorry for not going into more details but once getting familiar with the concept of MapReduce it is going to be very clear on how to target the solution
basicly we are going to implement two simple functions
Map(key, value)
Reduce(key, values[])
so all in all:
open file and iterate through the data
for each number -> Map(number, line_index)
in the reduce we will get the number as the key and the total occurrences as the number of values (including their positions in the file)
so in Reduce(key, values[]) if number of values > 1 than its a duplicate number
print the duplicates : number, line_index1, line_index2,...
again this approach can result in a very fast execution depending on how your MapReduce framework is set, highly scalable and very reliable, there are many diffrent implementations for MapReduce in many languages
there are several top companies presenting already built up cloud computing environments like Google, Microsoft azure, Amazon AWS, ...
or you can build your own and set a cluster with any providers offering virtual computing environments paying very low costs by the hour
good luck :)
Another more simple approach could be in using bloom filters
AdamT
Implement a BitArray such that ith index of this array will correspond to the numbers 8*i +1 to 8*(i+1) -1. ie first bit of ith number is 1 if we already had seen 8*i+1. Second bit of ith number is 1 if we already have seen 8*i + 2 and so on.
Initialize this bit array with size Integer.Max/8 and whenever you saw a number k, Set the k%8 bit of k/8 index as 1 if this bit is already 1 means you have seen this number already.
I understand there are multiple questions about this on SO, but I have yet to find a definitive answer of "yes, here's how..."
So here it is again: What are the possible ways to store an unsigned integer value (32-bit value or 32-bit bitmap) into a 4-byte field in SQL Server?
Here are ideas I have seen:
1) Use a -1*2^31 offset for all values
Disadvantages: need to perform math on the values before reading/writing/aggregating.
2) Use 4 tinyint fields
Disadvantages: need to concatenate values to perform any operations
3) Use binary(4)
Disadvantages: actually uses 4 + 2 bytes of space (Edit: varbinary(4) uses 4+2, binary(4) only uses 4)
Need to work in SqlBinary or cast to/from other types
IMO, you have the correct answers to storing 2^32 positive values in 4 bytes: either a standard int and you do the math or a binary(4) which, contrary to what you have said, will only consume 4 bytes of space. (Only varbinary will incur an extra 2 bytes of storage). A series of tinyint or smallint columns would be unjustifiably cumbersome IMO.
Of course there is another solution for storing 2^32 positive values but it takes eight bytes: a bigint with a check constraint. Given how cheap storage and memory is today, IMO, this is the simplest and cheapest solution given the programmatic hoops you will have to jump through with the other solutions, however clearly you have a reason for wanting to save the extra 4 bytes on each row.