Why are numbers used as ID in databases (think primary key + AI) and pretty much everywhere instead of letters?
There are 10 digits available, while the English alphabet has as much as 26 letters.
Let's say that each letter/digits takes a spot. 98 takes two spots and 1202 takes four, etc. In four spots, you can store up to 10 000 IDs, but if you use letters instead, you could store as much as 456 976 IDs with the same amount of spots. Even more if you'd use case-sensitivity. That's almost 50 times more.
I do realize that this most likely doesn't matter for the regular user, but why aren't larger databases using letters instead of digits as IDs?
You are confusing characters for numeric values.
An ID column that uses an integer (say a 32bit Integer) as the data type will only take 4 bytes per row. It will also be a native value in memory and can be acted on natively in the CPU (as a binary representation).
This is not the same for characters - even if one assumes ASCII (8bit) is used, the moment you go over 4 characters, you are using more space. You also need to translate between the values in order to make valid comparisons.
Numbers pack better. You suppose that because the numbers are displayed in decimal, that they are stored as decimal, but they're really binary. Optimized for computers :).
If you want to represent one of 26 letters, you need 25 binary digits. You lose 32-26=8 possible digits per block of 5 bits.
There's no hard and fast rule that says you can't use alphanumeric fileds as IDs in databases. People do it all the time.
As for why it's more common to use numbers...
Most database systems are designed with auto-increment ability on numbers. (Yeah I know, that's a chicken/egg scenario)
Numbers can/usually do take up less bytes of storage. (Yes, you can store large numbers and shorter strings to overcome this, but as a general rule...)
I was going to expand on this, but everyone else beat me to the punch with accurate descriptions of the difference between bytes needed to store an int compared to a varchar. It would be silly to add it now. ;-)
On every system I've worked with, sorting on numbers is different than sorting on strings:
The values 1, 12,3, 2, 20 are sorted numerically as 1,2,3,12,20, but when sorted alphanumerically: 1,12,2,20,3
More computational power is needed to overcome the previous point, so numbers are simply more efficient to work with.
This is the answer to why most databases are designed with auto-incrementing numbers as opposed to auto-incrementing strings in the first bullet. Whether it's the chicken or the egg, I'll leave up to you.
Because computers only work with numbers. Even characters are treated by the computer as numbers.
Also, using character strings is much LESS efficient than numbers.
Any number smaller than 4,294,967,296 (2^32) can be stored in only 4 bytes, whereas even a 5 character (and each character takes up one byte) alphabetic string allows only 11,881,376 possibilities.
Computers don't story a base-10 numerical digit in one byte. Each byte actually can hold 256 different possible values.
Related
I would like to generate a nicely-mixed-up integer fingerprint of an arbitrary C string (s). Most C strings will consist of ASCII text characters:
I want very different fingerprints for similar strings, esp such similar strings as "ab" and "ba"
I want it to be difficult to invert back from the fingerprint to the string (well, my string is typically longer than 32 bits, which means that many strings would map into the same integer), which means again that I want similar strings to yield very different codes;
I want to use the 32 bits available to me efficiently in the integer result,
I want the function source to be small
I want the function to be fast.
one usage is security (but not encryption) related. I can ask a user for a text password, convert it into an integer for storage and later test whether this integer is correct. (I know I could store strings, but I don't want to. guessing a 32-bit integer correctly is impossible if my program can slow down incorrect attempts to the point where brute force cannot work faster than password guessing. another use of this function is as the start of a hash index function (mod array length) into an array.)
alas, I am probably reinventing the wheel here. such functions have probably been written a million times, and by people who are much more versed in cryptography. I don't need AES, of course, but something much more lightweight. the use is different.
my first thinking was
mod 64 each character to take advantage of the ASCII text aspect. now I have 6 bits. call this x.
I can place a 6bit string into 5 locations in a 32-bit space, leaving 2 bits over.
take the current string index position (0, 1, 2...), mod5 it to determine where I want to start to place my x into my running integer result code. XOR my x into this running-result integer.
use the remaining 2 bits to increment a counter [mod 4 to prevent overflow] for each character processed.
then I thought that bit operations may be computer-fast but take more source code. I can think of other choices. take each index position i and multiply it by an ascii representation of each character [or the x from above], and call this y[i]. now do the following:
calculate the natural logarithm of the sums of the y (or this sum plus the running result), and just pretend that the first 32 bits of this result [maybe leaving off the first few bits], which are really a double, are an integer representation. I can XOR each bitint(log(y[i])) into the running integer result.
do it even cheaper. just add the y's, and then do the logarithm with 32-bit pickoff just once at the end. alternatively, run a sum-y through srand as a seed and grab a rand.
there are probably a few other ways to do it, too. in sum, the function should map strings into very different integers, be short to code, and be very fast.
Any pointers?
A common method of generating a non-reversible digest or hash of a string is to generate a Cyclic Redundancy Checksum (CRC).
Source for CRC is widely available, in this case you should use a common CRC-32 such as that used by Ethernet. Different CRCs work on the same principle, buy use different polynomials. Do not be tempted to invent your own polynomial; the distribution is likely to be sub-optimal.
What you're looking for is called a "hash". Two examples of hash functions I'm aware of that return short integers are MurmurHash and SipHash. MurmurHash, as I recall, is not designed to be a cryptographic hash, while SipHash, on the other hand, is indeed designed with security in mind, as stated on its homepage. MurmurHash has 2 versions that return a 32-bit and a 64-bit output. SipHash returns a 64-bit output.
I'm writing a program in which i require to normalise an 18-bit input between 0-9999. This is something i have never come across before,
I have searched the internet and correct me if i am wrong here, but is this as simple as converting the 18-bit binary(000000000000000000) input into a natural number and then divide it by 1000.
Is there is a different and more efficient method ????
Thank you
No, what you want to do is multiply your input by 0.03814697265.
The reasoning is pretty simple: you take your range of inputs (0..2^18) and split it in 10000 "slices". Thus each slice will have a range of just over 26. Then if you divide your input from the original range by this 26 (or multiply it by 1/26), you'll get your number in the 0..9999 range.
Edit: depending on your background, you may need to know that here I use ^ with the meaning of exponentiation. Might be moot since this question is tagged C and it has no first-class concept of exponentiation, but it's definetly not XOR!
I need to find the position( or index ) say i of an integer array A of size 100, such that A[i]=0. 99 elements of array A are 1 and only one element is 0. I want the most efficient way solving this problem.(So no one by one element comparison).
Others have already answered the fundamental question - you will have to check all entries, or at least, up until the point where you find the zero. This would be a worst case of 99 comparisons. (Because if the first 99 are ones then you already know that the last entry must be the zero, so you don't need to check it)
The possible flaw in these answers is the assumption that you can only check one entry at a time.
In reality we would probably use direct memory access to compare several integers at once. (e.g. if your "integer" is 32 bits, then processors with SIMD instructions could compare 128 bits at once to see if any entry in a group of 4 values contains the zero - this would make your brute force scan just under 4 times faster. Obviously the smaller the integer, the more entries you could compare at once).
But that isn't the optimal solution. If you can dictate the storage of these values, then you could store the entire "array" as binary bits (0/1 values) in just 100 bits (the easiest would be to use two 64-bit integers (128 bits) and fill the spare 28 bits with 1's) and then you could do a "binary chop" to find the data.
Essentially a "binary chop" works by chopping the data in half. One half will be all 1's, and the other half will have the zero in it. So a single comparison allows you to reject half of the values at once. (You can do a single comparison because half of your array will fit into a 64-bit long, so you can just compare it to 0xffffffffffffffff to see if it is all 1's). You then repeat on the half that contains the zero, chopping it in two again and determining which half holds the zero... and so on. This will always find the zero value in 7 comparisons - much better than comparing all 100 elements individually.
This could be further optimised because once you get down to the level of one or two bytes you could simply look up the byte/word value in a precalculated look-up table to tell you which bit is the zero. This would bring the algorithm down to 4 comparisons and one look-up (in a 64kB table), or 5 comparisons and one look-up (in a 256-byte table).
So we're down to about 5 operations in the worst case.
But if you could dictate the storage of the array, you could just "store" the array by noting down the index of the zero entry. There is no need at all to store all the individual values. This would only take 1 byte of memory to store the state, and this byte would already contain the answer, giving you a cost of just 1 operation (reading the stored value).
You cannot do it better then linear scan - unless the data is sorted or you have some extra data on it. At the very least you need to read all data, since you have no clue where this 0 is hiding.
If it is [sorted] - just access the relevant [minimum] location.
Something tells me that the expected answer is "compare pairs":
while (a[i] == a[i+1]) i += 2;
Although it looks better that the obvious approach, it's still O(n),
Keep track of it as you insert to build the array. Then just access the stored value directly. O(1) with a very small set of constants.
Imagine 100 sea shells, under one is a pearl. There is no more information.
There is really no way to find it faster than trying to turn them all over. The computer can't do any better with the same knowledge. In other words, a linear scan is the best you can do unless you save the position of the zero earlier in the process and just use that.
More trivia than anything else, but if you happen to have a quantum computer this can be done faster than linear.
Grover's algortithm
Here's my problem (I'm programming in C):
I have some huge text files containing DNA sequences (each file has something like 65 million rows and a size of about 4~5 GB). In these files there are many duplicates (don't know how many yet, but there should be many millions of them) and I want to return in output a file with only distinct values. Each string has a quality value associated, so if e.g I have 5 equal strings with different quality values I'll hold the best one and discard the other 4.
Reducing memory requirements and improving speed efficiency as far as I can is VITAL.
My idea was to create a JudyHS array using an hash function in order to convert the String DNA sequence (which is 76 letters long and has 7 possible characters) into an integer to reduce memory usage (4 or 8 bytes instead of 76 bytes on many millions of entries should be quite an achievement). This way I could use the integer as index and store only the best quality value for that index. The problem is that I can't find an hash function that UNIVOCALLY defines such a long string and produces a value that can be stored inside an integer or even a long long!
My first idea for an hash function was something like the default string hash function in Java: s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1], but I could obtain a maximal value of 8,52*10^59.. way tooo big.
What about doing the same thing and store it in a double? Would the computation become a lot slower?
Please note that I'd like a way to UNIVOCALLY define a string, avoiding collisions (or at least they should be extremely rare, because I would have to access the disk at every collision, quite a costly operation...)
You have 7^76 possible DNA sequences and want to map them to 2^32 hashes without collisions? Not possible.
You need a least log2(7^76) = 214 bits to do that, about 27 bytes.
I you can live with some collisions I would recommend to stick to CRC32 or md5 instead of inventing a new wheel again.
The "simple" way to get a collision-free hash function for N elements is to use a good mixing function (say, a cryptographic hash function) and to truncate the size, so that the hash results live in a space of size at least N2. Here, you have 65 million rows -- this fits on 26 bits (226 is close to 65 millions) so 52 bits "ought to be enough".
You can try using a fast cryptographic hash function, even a "broken" one since this is not a security-related problem. MD4, MD5, SHA-1... then truncate the result to the first (or last) 64 bits, store that in a 64-bit integer type. Chances are that you will not get any collision among your 65 million rows; and if you get some, they will be very rare.
For optimized C implementations of hash functions, lookup sphlib. Use the provided sph_dec64le() function to "decode" a sequence of 8 bits into a 64-bit unsigned integer value.
I'm writing a utility to calculate π to a million digits after the decimal. On a 32- or 64-bit consumer desktop system, what is the most efficient way to store and work with such a large number accurate to the millionth digit?
clarification: The language would be C.
Forget floating point, you need bit strings that represent integers
This takes a bit less than 1/2 megabyte per number. "Efficient" can mean a number of things. Space-efficient? Time-efficient? Easy-to-program with?
Your question is tagged floating-point, but I'm quite sure you do not want floating point at all. The entire idea of floating point is that our data is only known to a few significant figures and even the famous constants of physics and chemistry are known precisely to only a handful or two of digits. So there it makes sense to keep a reasonable number of digits and then simply record the exponent.
But your task is quite different. You must account for every single bit. Given that, no floating point or decimal arithmetic package is going to work unless it's a template you can arbitrarily size, and then the exponent will be useless. So you may as well use integers.
What you really really need is a string of bits. This is simply an array of convenient types. I suggest <stdint.h> and simply using uint32_t[125000] (or 64) to get started. This actually might be a great use of the more obscure constants from that header that pick out bit sizes that are fast on a given platform.
To be more specific we would need to know more about your goals. Is this for practice in a specific language? For some investigation into number theory? If the latter, why not just use a language that already supports Bignum's, like Ruby?
Then the storage is someone else's problem. But, if what you really want to do is implement a big number package, then I might suggest using bcd (4-bit) strings or even ordinary ascii 8-bit strings with printable digits, simply because things will be easier to write and debug and maximum space and time efficiency may not matter so much.
I'd recommend storing it as an array of short ints, one per digit, and then carefully write utility classes to add and subtract portions of the number. You'll end up moving from this array of ints to floats and back, but you need a 'perfect' way of storing the number - so use its exact representation. This isn't the most efficient way in terms of space, but a million ints isn't very big.
It's all in the way you use the representation. Decide how you're going to 'work with' this number, and write some good utility functions.
If you're willing to tolerate computing pi in hex instead of decimal, there's a very cute algorithm that allows you to compute a given hexadecimal digit without knowing the previous digits. This means, by extension, that you don't need to store (or be able to do computation with) million digit numbers.
Of course, if you want to get the nth decimal digit, you will need to know all of the hex digits up to that precision in order to do the base conversion, so depending on your needs, this may not save you much (if anything) in the end.
Unless you're writing this purely for fun and/or learning, I'd recommend using a library such as GNU Multiprecision. Look into the mpf_t data type and its associated functions for storing arbitrary-precision floating-point numbers.
If you are just doing this for fun/learning, then represent numbers as an array of chars, which each array element storing one decimal digit. You'll have to implement long addition, long multiplication, etc.
Try PARI/GP, see wikipedia.
You could store its decimals digits as text in a file and mmap it to an array.
i once worked on an application that used really large numbers (but didnt need good precision). What we did was store the numbers as logarithms since you can store a pretty big number as a log10 within an int.
Think along this lines before resorting to bit stuffing or some complex bit representations.
I am not too good with complex math, but i reckon there are solutions which are elegant when storing numbers with millions of bits of precision.
IMO, any programmer of arbitrary precision arithmetics needs understanding of base conversion. This solves anyway two problems: being able to calculate pi in hex digits and converting the stuff to decimal representation and as well finding the optimal container.
The dominant constraint is the number of correct bits in the multiplication instruction.
In Javascript one has always 53-bits of accuracy, meaning that a Uint32Array with numbers having max 26 bits can be processed natively. (waste of 6 bits per word).
In 32-bit architecture with C/C++ one can easily get A*B mod 2^32, suggesting basic element of 16 bits. (Those can be parallelized in many SIMD architectures starting from MMX). Also each 16-bit result can contain 4-digit decimal numbers (wasting about 2.5 bits) per word.