Why is there a difference in precision range widths for decimal? - sql-server

As is evident by the MSDN description of decimal certain precision ranges have the same amount of storage bytes assigned to them.
What I don't understand is that there are differences in the sizes of the range. How the range from 1 to 9 of 5 storage bytes has a width of 9, while the range from 10 to 19 of 9 storage bytes has a width of 10. Then the next range of 13 storage bytes has a width of 9 again, while the next has a width of 10 again.
Since the storage bytes increase by 4 every time, I would have expected all of the ranges to be the same width. Or maybe the first one to be smaller to reserve space for the sign or something but from then on equal in width. But it goes from 9 to 10 to 9 to 10 again.
What's going on here? And if it would exist, would 21 storage bytes have a precision range of 39-47 i.e. is the pattern 9-10-9-10-9-10...?

would 21 storage bytes have a precision range of 39-47
No. 2 ^ 160 = 1,461,501,637,330,902,918,203,684,832,716,283,019,655,932,542,976 - which has 49 decimal digits. So this hypothetical scenario would cater for a precision range of 39-48 (as a 20 byte integer would not be big enough to hold any 49 digit numbers larger than that)
The first byte is reserved for the sign.
01 is used for positive numbers; 00 for negative.
The remainder stores the value as an integer. i.e. 1.234 would be stored as the integer 1234 (or some multiple of 10 of this dependant on the declared scale)
The length of the integer is either 4, 8, 12 or 16bytes depending on the declared precision. Some 10 digit integers can be stored in 4 bytes however to get the whole range in would overflow this so it needs to go to the next step up.
And so on.
2^32 = 4,294,967,295 (10 digits)
2^64 = 18,446,744,073,709,551,616 (20 digits)
2^96 = 79,228,162,514,264,337,593,543,950,336 (29 digits)
2^128 = 340,282,366,920,938,463,463,374,607,431,768,211,456 (39 digits)
You need to use DBCC PAGE to see this, casting the column as binary does not give you the storage representation. Or use a utility like SQL Server internals viewer.
CREATE TABLE T(
A DECIMAL( 9,0),
B DECIMAL(19,0),
C DECIMAL(28,0) ,
D DECIMAL(38,0)
);
INSERT INTO T VALUES
(999999999, 9999999999999999999, 9999999999999999999999999999, 99999999999999999999999999999999999999),
(-999999999, -9999999999999999999, -9999999999999999999999999999, -99999999999999999999999999999999999999);
Shows the first row stored as
And the second as
Note that the values after the sign bit are byte reversed. 0x3B9AC9FF = 999999999

Related

Can I sum integer input from terminal without saving the input as a variable?

I'm trying to write a code for digital root of an extremely big number and can't save it as a variable. Is it possible to do without it?
What you're looking to do is to repeatedly add the digits of a number until you're left with a single digit number, i.e. given 123456, you want 1 + 2 + 3 + 4 + 5 + 6 = 21 ==> 2 + 1 = 3
For a number with up to 50 million digits, the sum of those digits will be no more than 500 million which is well within the range of a 32-bit int.
Start by reading the large number as a string. Then iterate over each character in the string. For each character, verify that it's a character digit, i.e. between '0' and '9'. Convert that character to the appropriate number, then add that number to the sum.
Once you've done that, you've got the first-level sum stored in an int. Now you can loop through the digits of that number using x % 10 to get the lowest digit and x / 10 to shift over the remaining digits. Once you've exhausted the digits, repeat the process until you're left with a value less than 10.

Integer compression method

How can I compress a row of integers into something shorter ?
Like:
Input: '1 2 4 5 3 5 2 3 1 2 3 4' -> Algorithm -> Output: 'X Y Z'
and can get it back the other way around? ('X Y Z' -> '1 2 4 5 3 5 2 3 1 2 3 4')
Note:Input will only contain numbers between 1-5 and the total string of number will be 10-16
Is there any way I can compress it to 3-5 numbers?
Here is one way. First, subtract one from each of your little numbers. For your example input that results in
0 1 3 4 2 4 1 2 0 1 2 3
Now treat that as the base-5 representation of an integer. (You can choose either most significant digit first or last.) Calculate the number in binary that means the same thing. Now you have a single integer that "compressed" your string of little numbers. Since you have shown no code of your own, I'll just stop here. You should be able to implement this easily.
Since you will have at most 16 little numbers, the maximum resulting value from that algorithm will be 5^16 which is 152,587,890,625. This fits into 38 bits. If you need to store smaller numbers than that, convert your resulting value into another, larger number base, such as 2^16 or 2^32. The former would result in 3 numbers, the latter in 2.
#SergGr points out in a comment that this method does not show the number of integers encoded. If that is not stored separately, that can be a problem, since the method does not distinguish between leading zeros and coded zeros. There are several ways to handle that, if you need the number of integers included in the compression. You could require the most significant digit to be 1 (first or last depends on where the most significant number is.) This increases the number of bits by one, so you now may need 39 bits.
Here is a toy example of variable length encoding. Assume we want to encode two strings: 1 2 3 and 1 2 3 0 0. How the results will be different? Let's consider two base-5 numbers 321 and 00321. They represent the same value but still let's convert them into base-2 preserving the padding.
1 + 2*5 + 3*5^2 = 86 dec = 1010110 bin
1 + 2*5 + 3*5^2 + 0*5^3 + 0*5^4 = 000001010110 bin
Those additional 0 in the second line mean that the biggest 5-digit base-5 number 44444 has a base-2 representation of 110000110100 so the binary representation of the number is padded to the same size.
Note that there is no need to pad the first line because the biggest 3-digit base-5 number 444 has a base-2 representation of 1111100 i.e. of the same length. For an initial string 3 2 1 some padding will be required in this case as well, so padding might be required even if the top digits are not 0.
Now lets add the most significant 1 to the binary representations and that will be our encoded values
1 2 3 => 11010110 binary = 214 dec
1 2 3 0 0 => 1000001010110 binary = 4182 dec
There are many ways to decode those values back. One of the simplest (but not the most efficient) is to first calculate the number of base-5 digits by calculating floor(log5(encoded)) and then remove the top bit and fill the digits one by one using mod 5 and divide by 5 operations.
Obviously such encoding of variable-length always adds exactly 1 bit of overhead.
Its call : polidatacompressor.js but license will be cost you, you have to ask author about prices LOL
https://github.com/polidatacompressor/polidatacompressor
Ncomp(65535) will output: 255, 255 and when you store this in database as bytes you got 2 char
another way is to use "Hexadecimal aka base16" in javascript (1231).toString(16) give you '4cf' in 60% situation it compress char by -1
Or use base10 to base64 https://github.com/base62/base62.js/
4131 --> 14D
413131 --> 1Jtp

Why floating-points number's significant numbers is 7 or 6

I see this in Wikipedia log 224 = 7.22.
I have no idea why we should calculate 2^24 and why we should take log10......I really really need your help.
why floating-points number's significant numbers is 7 or 6 (?)
Consider some thoughts employing the Pigeonhole principle:
binary32 float can encode about 232 different numbers exactly. The numbers one can write in text like 42.0, 1.0, 3.1415623... are infinite, even if we restrict ourselves to a range like -1038 ... +1038. Any time code has a textual value like 0.1f, it is encoded to a nearby float, which may not be the exact same text value. The question is: how many digits can we code and still maintain distinctive float?
For the various powers-of-2 range, 223 (8,388,608) values are normally linearly encoded.
Example: In the range [1.0 ... 2.0), 223 (8,388,608) values are linearly encoded.
In the range [233 or 8,589,934,592 ... 234 or 17,179,869,184), again, 223 (8,388,608) values are linearly encoded: 1024.0 apart from each other. In the sub range [9,000,000,000 and 10,000,000,000), there are about 976,562 different values.
Put this together ...
As text, the range [1.000_000 ... 2.000_000), using 1 lead digit and 6 trailing ones, there are 1,000,000 different values. Per #3, In the same range, with 8,388,608 different float exist, allowing each textual value to map to a different float. In this range we can use 7 digits.
As text, the range [9,000,000 × 103 and 10,000,000 × 103), using 1 lead digit and 6 trailing ones, there are 1,000,000 different values. Per #4, In the same range, there are less than 1,000,000 different float values. Thus some decimal textual values will convert to the same float. In this range we can use 6, not 7, digits for distinctive conversions.
The worse case for typical float is 6 significant digits. To find the limit for your float:
#include <float.h>
printf("FLT_DIG = %d\n", FLT_DIG); // this commonly prints 6
... no idea why we should calculate 2^24 and why we should take log10
224 is a generalization as with common float and its 24 bits of binary precision, that corresponds to fanciful decimal system with 7.22... digits. We take log10 to compare the binary float to decimal text.
224 == 107.22...
Yet we should not take 224. Let us look into how FLT_DIG is defined from C11dr §5.2.4.2.2 11:
number of decimal digits, q, such that any floating-point number with q decimal digits can be rounded into a floating-point number with p radix b digits and back again without change to the q decimal digits,
p log10 b ............. if b is a power of 10
⎣(p − 1) log10 _b_⎦.. otherwise
Notice "log10 224" is same as "24 log10 2".
As a float, the values are distributed linearly between powers of 2 as shown in #2,3,4.
As text, values are distributed linearly between powers of 10 like a 7 significant digit values of [1.000000 ... 9.999999]*10some_exponent.
The transition of these 2 groups happen at different values. 1,2,4,8,16,32... versus 1,10,100, ... In determining the worst case, we subtract 1 from the 24 bits to account for the mis-alignment.
⎣(p − 1) log10 _b_⎦ --> floor((24 − 1) log10(2)) --> floor(6.923...) --> 6.
Had our float used base 10, 100, or 1000, rather than very common 2, the transition of these 2 groups happen at same values and we would not subtract one.
An IEEE 754 single-precision float has a 24-bit mantissa. This means it has 24 binary bits' worth of precision.
But we might be interested in knowing how many decimal digits worth of precision it has.
One way of computing this is to consider how many 24-bit binary numbers there are. The answer, of course, is 224. So these binary numbers go from 0 to 16777215.
How many decimal digits is that? Well, log10 gives you the number of decimal digits. log10(224) is 7.2, or a little more than 7 decimal digits.
And look at that: 16777215 has 8 digits, but the leading digit is just 1, so in fact it's only a little more than 7 digits.
(Of course this doesn't mean we can represent only numbers from 0 to 16777215! It means we can represent numbers from 0 to 16777215 exactly. But we've also got the exponent to play with. We can represent numbers from 0 to 1677721.5 more or less exactly to one place past the decimal, numbers from 0 to 167772.15 more or less exactly to two decimal points, etc. And we can represent numbers from 0 to 167772150, or 0 to 1677721500, but progressively less exactly -- always with ~7 digits' worth of precision, meaning that we start losing precision in the low-order digits to the left of the decimal point.)
The other way of doing this is to note that log10(2) is about 0.3. This means that 1 bit corresponds to about 0.3 decimal digits. So 24 bits corresponds to 24 × 0.3 = 7.2.
(Actually, IEEE 754 single-precision floating point explicitly stores only 23 bits, not 24. But there's an implicit leading 1 bit in there, so we do get the effect of 24 bits.)
Let's start a little smaller. With 10 bits (or 10 base-2 digits), you can represent the numbers 0 upto 1023. So you can represent up to 4 digits for some values, but 3 digits for most others (the ones below 1000).
To find out how many base-10 (decimal) digits can be represented by a bunch of base-2 digits (bits), you can use the log10() of the maximum representable value, i.e. log10(2^10) = log10(2) * 10 = 3.01....
The above means you can represent all 3 digit — or smaller — values and a few 4 digits ones. Well, that is easily verified: 0-999 have at most 3 digits, and 1000-1023 have 4.
Now take 24 bits. In 24 bits you can store log10(2^24) = 24 * log(2) base-10 digits. But because the top bit is always the same, you can in fact only store log10(2^23) = log10(8388608) = 6.92. This means you can represent most 7 digits numbers, but not all. Some of the numbers you can represent faithfully can only have 6 digits.
The truth is a bit more complicated though, because exponents play role too, and some of the many possible larger values can be represented too, so 6.92 may not be the exact value. But it gets close, and can nicely serve as a rule of thumb, and that is why they say that single precision can represent 6 to 7 digits.

Why do we always divide RGB values by 255? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
Why do we always divide our RGB values by 255? I know that the range is from [0-1]. But why dive only by 255? Can anyone please explain me the concepts of RGB values?
RGB (Red, Green, Blue) are 8 bit each.
The range for each individual colour is 0-255 (as 2^8 = 256 possibilities).
The combination range is 256*256*256.
By dividing by 255, the 0-255 range can be described with a 0.0-1.0 range where 0.0 means 0 (0x00) and 1.0 means 255 (0xFF).
This is a bit of a generic question since it can be specific to the platform and even to the method. It really comes down to math and getting a value between 0-1. Since 255 is the maximum value, dividing by 255 expresses a 0-1 representation.
Each channel (Red, Green, and Blue are each channels) is 8 bits, so they are each limited to 256, in this case 255 since 0 is included. As the reference shows, systems typically use values between 0-1 when using floating point values.
http://en.wikipedia.org/wiki/RGB_color_model
See Numeric Representations.
These ranges may be quantified in several different ways: From 0 to 1,
with any fractional value in between. This representation is used in
theoretical analyses, and in systems that use floating point
representations. Each color component value can also be written as a
percentage, from 0% to 100%. In computers, the component values are
often stored as integer numbers in the range 0 to 255, the range that
a single 8-bit byte can offer. These are often represented as either
decimal or hexadecimal numbers. High-end digital image equipment are
often able to deal with larger integer ranges for each primary color,
such as 0..1023 (10 bits), 0..65535 (16 bits) or even larger, by
extending the 24-bits (three 8-bit values) to 32-bit, 48-bit, or
64-bit units (more or less independent from the particular computer's
word size).
RGB values are usually stored as integers to save memory. But doing math on colors is usually done in float because it's easier, more powerful, and more precise. The act of converting floats to integers is called "Quantization", and it throws away precision.
Typically, RGB values are encoded as 8-bit integers, which range from 0 to 255. It's an industry standard to think of 0.0f as black and 1.0f as white (max brightness). To convert [0, 255] to [0.0f, 1.0f] all you have to do is divide by 255.0f.
If you care, this is the formula to convert back to integer: (int)floor(x * 255.0f + 0.5f). But first clamp x to [0.0f, 1.0f] if necessary.
The RGB value goes up from 0 to 255 because it takes up exactly one byte of data. One byte is equal to 8 bits, and each bit represents either a 0 or a 1. This makes 0 in 8 bit binary: 00000000 and 255 11111111. The last bit says if there is a 1 in the value. The second last says if there is a 2 in the value. The third last says if there is a 4 in the value, and so on doubling every time. If you add up all of the small values that are present, you get the total value. For example,
=10110101
=1*128 + 0*64 + 1*32 + 1*16 + 0*8 + 1*4 + 0*2 + 1*1
=128 + 32 + 16 + 4 + 1
=181
This means that 10110101 in binary equals 181 in decimal form.
Given that each Ocket is nowadays made of 8 bits ( binary digit)
Suppose we have an Ocket filled like this :
1 0 1 0 0 1 0 1
for each bit you get 2 possibilities : 0 or 1
2 x 2 x 2 x 2 x 2 x 2 x 2 x 2 = 2^8 = 256
total : 256
And for hexadecimal colors :
given that you have 3 couples of characters, dash excluded => ex: #00ff00
0, 1, 2, 3, 4 , 5, 6, 7, 8, 9, a, b, c, d, e, f = 16 possibilities
16 x 16 = 256
R V B = color
256 x 256 x 256 = 16 777 216 colors)
It makes the vector operations simpler... Imagine you have image and you want to change its color to red. With vectors you can just take every pixel and multiply it by (1.0, 0.0, 0.0)
P * (1.0, 0.0, 0.0)
Otherwise it just adds unnecessary steps (in this case dividing it by 255)
P * (255, 0, 0) / 255
And imagine using more complex filters, the unnecessary steps would stack up...

Compact data structure for sorted array

I have a table with sorted numbers like:
1 320102
2 5200100
3 92010023
4 112010202
5 332020201
6 332020411
:
5000000000 3833240522044511
5000000001 3833240522089999
5000000002 4000000000213312
Given the record number I need the value in O(log n) time. The record number is 64-bit long and there are no missing record numbers. The values are 64-bit long, they are sorted and value(n) < value(n+1).
The obvious solution is simply doing an array and use the records number as index. This will cost 64-bit per value.
But I would like a more space efficient way of doing that. Since we know the values are always increasing that should be doable, but I do not remember a data structure that lets me do that.
A solution would be to use deflate on the array, but that will not give me O(log n) for accessing an element - thus unacceptable.
Do you know of a data structure that will give me:
O(log n) for access
space requirement < 64-bit/value
= Edit =
Since we know all numbers in advance we could find the difference between each number. By taking the 99th percentile of these differences we will get a relatively modest number. Taking the log2 will give us the number of bits needed to represent modest number - let us call that modest-bits.
Then create this:
64-bit value of record 0
64-bit value of record 1024
64-bit value of record 2048
64-bit value of record 3072
64-bit value of record 4096
Then a delta table for all records:
modest-bits difference to record 0
modest-bits difference to previous record
1022 * modest-bits difference to previous record
modest-bits difference to record 1024
modest-bits difference to record k*1024 will always be 0, so we can use that for signaling. If it is non-zero, then the following 64-bit will be a pointer to a simple array for the next 1024 records as 64-bit values.
As the modest value is chosen as the 99th percentile number, that will at most happen 1% of the time, thus wasting at most 1% * n * modest-bits + 1% * n * 64-bit * 1024.
space: O(modest-bits * n + 64-bit * n / 1024 + 1% * n * modest-bits + 1% * n * 64-bit * 1024)
lookup: O(1 + 1024)
(99% and 1024 may have to be adjusted)
= Edit2 =
Based on the idea above, but wasting less space. Create this:
64-bit value of record 0
64-bit value of record 1024
64-bit value of record 2048
64-bit value of record 3072
64-bit value of record 4096
And for all value that cannot be represented by modest-bits create big-value table as a tree:
64-bit position, 64-bit value
64-bit position, 64-bit value
64-bit position, 64-bit value
Then a delta table for all records, that is reset for every 1024 records:
modest-bits difference to record 0
modest-bits difference to previous record
1022 * modest-bits difference to previous record
modest-bits difference to record 1024
but also reset for every value that is in the big-value table.
space: O(modest-bits * n + 64-bit * n / 1024 + 1% * n * 2 * 64-bit).
Lookup requires searching big-value table, then looking up the 1024'th value and finally summing up the modest-bits values.
lookup: O(log(big-value table) + 1 + 1024) = O(log n)
Can you improve this? Or do better in a different way?
OP proposes splitting numbers into blocks (only once). But this process may be continued. Split every block once more. And again... Finally we might get a binary trie.
Root node contains value of the number with least index. Its right descendant stores difference between the middle number in the table and the number with least index: d = A[N/2] - A[0] - N/2. This is continued for other right descendants (red nodes on diagram). Leaf nodes contain deltas from preceding numbers: d = A[i+1] - A[i] - 1.
So most of the values, stored in trie, are delta values. Each of them occupies less than 64 bits. And for compactness they may be stored as variable-bit-length numbers in a bit stream. To get length of each number and to navigate in this structure in O(log N) time, bit stream should also contain lengths of (some) numbers and (some) subtrees:
Each node contains length (in bits) of its left sub-tree (if it has one).
Each right descendant (red nodes on diagram), except leaf nodes, contains length (in bits) of its value. Leaf node's length may be calculated from other lengths on the path from root to this node.
Each right descendant (red nodes on diagram) contains difference of correspondent value and the value of nearest "red" node up the path.
All nodes are packed in bit stream, starting from root node, in-order: left descendant always follows its ancestor; right descendant follows sub-tree, rooted by left descendant.
To access element given its index, use index's binary representation to follow path in the trie. While traversing this path, add together all values of "red" nodes. Stop when no more non-zero bits are left in the index.
There are several options to store N/2 value lengths:
Allocate as many bits for each length as needed to represent all values from the largest length to somewhere below mean length (excluding some very short outliers).
Also exclude some long outliers (keep them in a separate map).
Since lengths may be not evenly distributed, it's reasonable to use Huffman encoding for value lengths.
Either fixed length or Huffman encodings should be different for each trie depth.
N/4 subtree lengths are, in fact, value lengths, because N/4 smallest subtrees contain a single value.
Other N/4 subtree lengths may be stored in words of fixed (predefined) length, so that for large subtrees we know only approximate (rounded up) lengths.
For 230 full-range 64-bit numbers we have to pack approximately 34-bit values, for 3/4 nodes, approx. 4-bit value lengths, and for every fourth node, 10-bit subtree lengths. Which saves 34% space.
Example values:
0 320102
1 5200100
2 92010023
3 112010202
4 332020201
5 332020411
6 3833240522044511
7 3833240522089999
8 4000000000213312
Trie for these values:
root d=320102 vl=19 tl=84+8+105+4+5=206
+-l tl=75+4+5=84
| +-l tl=23
| | +-l
| | | +-r d=4879997 (vl=23)
| | +-r d=91689919 vl=27
| | +-r d=20000178 (vl=25)
| +-r d=331700095 vl=29 tl=8
| +-l
| | +-r d=209 (vl=8)
| +-r d=3833240190024308 vl=52
| +-r d=45487 (vl=16)
+-r d=3999999999893202 vl=52
Value length encoding:
bits start end
Root 0 19 19
depth 1 0 52 52
depth 2 0 29 29
depth 3 5 27 52
depth 4 4 8 23
Sub-tree lengths need 8 bits each.
Here is encoded stream (binary values still shown in decimal for readability):
bits value comment
19 320102 root value
8 206 left subtree length of the root
8 84 left subtree length
4 15 smallest left subtree length (with base value 8)
23 4879997 value for index 1
5 0 value length for index 2 (with base value 27)
27 91689919 value for index 2
25 20000178 value for index 3
29 331700095 value for index 4
4 0 smallest left subtree length (with base value 8)
8 209 value for index 5
5 25 value length for index 6 (with base value 27)
52 3833240190024308 value for index 6
16 45487 value for index 7
52 3999999999893202 value for index 8
Altogether 285 bits or 5 64-bit words. We also need to store bits/start values from value length encoding table (350 bits). To store 635 bits we need 10 64-bit words, which means such a small number table cannot be compressed. For larger number tables, size of value length encoding table is negligible.
To search a value for index 7, read root value (320102), skip 206 bits, add value for index 4 (331700095), skip 8 bits, add value for index 6 (3833240190024308), add value for index 7 (45487), and add index (7). The result is 3 833 240 522 089 999, as expected.
I would do it in blocks, as you outline in your question. Pick a block size k, where you can accept having to decode on average k/2 values before getting to the one you're after. For the n total values, you will have n/k blocks. A table with n/k entries would point into the data stream to find the starting point of each block. Finding where to go in that table would be O(log(n/k)) for a binary search, or if the table is small enough and if it matters, you could make it about O(1) with an auxiliary hash table.
Each block would start with a starting 64-bit value. All values after that would be stored as deltas from the preceding value. My suggestion is to store those deltas as a Huffman code that says how many bits are in the next value, followed by that many bits. The Huffman code would be optimized for each block, and a description of that code would be stored at the beginning of the block.
You could simplify that by just preceding each value with six bits having the number of bits following, in the range of 1..64, effectively a flat Huffman code. Depending on the histogram of the bit lengths, an optimized Huffman code could knock off a good number of bits compared to the flat code.
Once you have this set up, you can experiment with k and see how small you can make it and still have limited impact on the compression.
I do not know of a data structure that does that.
The obvious solution to gain space and not loose too much speed would be to create your own structure with different array size based on the different int size you store.
Pseudo-code
class memoryAwareArray {
array16 = Int16[] //2 bytes
array32 = Int32[] //4 bytes
array64 = Int64[] //8 bytes
max16Index = 0;
max32Index = 0;
addObjectAtIndex(index, value) {
if (value < 65535) {
array16[max16Index] = value;
max16Index++;
return;
}
if (value < 2147483647) {
array32[max32Index] = value;
max32Index++;
return;
}
array64[max64Index] = value;
max64Index++;
}
getObject(index) {
if (index < max16Index) return(array16[index]);
if (index < max32Index) return(array32[index-max16Index]);
return(array64[index-max16Index-max32Index]);
}
}
Something along those lines shouldn't alter to much the speed and you'd save around 7 gigas if you filled up the entire structure. You won't save as much since you have gaps beetween your values of course.

Categories

Resources