SQL Server: Hex values padded with zeros and bytes out of order - sql-server

I'm working on a database that has a VARBINARY(255) column that doesn't make sense to me. Depending on the length of the value, the value is either numbers or words.
For whatever number is stored, it is a 4-byte hex string 0x00000000, but reads left to right while the bytes read right to left. So for a number such as 255, it is 0xFF000000 and for a number such as 745, it is 0xE9020000. This is the part that I do not understand, why is it stored that way instead of 0x02E9, 0x2E9 or 0x000002E9?
When it comes to words, each character is stored as a 4-byte hex string just like above. Something like a space is stored as 0x20000000, but a word like Sensor it is 0x53000000650000006E000000730000006F00000072000000 instead of just 0x53656E736F72.
Can anyone explain to me why the data is stored in this way? Is everything represented as 4-byte strings because the numbers stored can be the full 4-bytes while text is padded with zeros for consistency? Why are the zeros padded to the right of the value? Why are the values stored with the 4th byte first and 1st byte last?
If none of this makes sense from an SQL standpoint, I suppose it is possible that the data is being provided this way from the client application which I do not have access to the source on. Could that be the case?
Lastly, I would like to create a report that includes this column, but converted to the correct numbers or words. Is there a simpler and more performant method than using substrings, trims, and recursion?

With the help of Smor in the comments above, I can now answer my own questions.
The client application provides the 4-byte strings and the database just takes them as they fit within the column's VARBINARY(255) data type and length. Since the application is providing the values in a little-endian format, they are stored in that way within the database with the least significant byte first and the most significant byte last. Being that most values are smaller than the static 4-byte length, the values are padded with zeros to the right to fit the 4-byte requirement.
Now as to my question of the report, this is what I came up with:
CASE
WHEN LEN(ByteValue) <= 4
THEN CAST(CAST(CAST(REVERSE(ByteValue) AS VARBINARY(4)) AS INT) AS VARCHAR(12))
ELSE CAST(CONVERT(VARBINARY(255),REPLACE(CONVERT(VARCHAR(255),ByteValue,1),'000000',''),1) AS VARCHAR(100))
END AS PlainValue
In my particular case, only numbers are stored as just 4-byte or less values while words are stored as much longer values. This allows me to break the smaller values into numbers while longer values are broken down into words.
Using CASE WHEN I can specify that only data 4-bytes or less needs the REVERSE() function as it is the easiest way to convert the little-endian format to the big-endian format that SQL is looking for when converting from hex to integers. Due to the REVERSE() function returning a NVARCHAR datatype, I then have to convert that back to VARBINARY, then to INT, then to VARCHAR to match the datatype of the second case datatype.
Any string longer than 4-bytes, used specifically for words, falls under the ELSE part and allows me to strip the extra zeros from the hex value so I get just the first byte of each 4-byte long character (the only part that matters in my situation). By converting the hex string to VARCHAR, I can then easily remove the 6 repeating zeros using the REPLACE() function. With the zeros gone, converting the string back to VARBINARY allows converting to VARCHAR to be done with ease.

Related

Question about the description of "bit string type" in the openGauss official website document

I noticed that in the openGauss official website document, the bit string type is described as follows: "A bit string is a string of 1s and 0s", and found that this type is not included under "character type" and "binary type", It is an independent type. Since "0, 1" and "string" are mentioned in the description, there is some confusion about this type, and the following three questions are raised:
Does this type store binary data or character data?
If the binary data is stored, according to the answer in the previous forum (the bit string type has no storage upper limit), then the difference between the bit string type and the binary type is only that the bit string has no storage space upper limit and the binary type has storage space limit this?
Can it be used to store larger (eg >2GB) raw binary data?
Bit string type: It is a 01 string, but the underlying memory in the database will be stored at the bit level of 01 to save space. Without paying too much attention to its underlying logic, it is a special string that can only consist of the character 01. Convenient to store some masks and stuff for us.
Binary type: Specialized to store binary. Taking bytea as an example, any ascii character is input in the SQL statement, and the corresponding ascii binary is stored, and the query displays the hexadecimal code corresponding to ascii. For example insert 'a', then the result of select will be \x61. Other binary types are similar.
Taking the input character '0' as an example, the bit string type stores bit 0, and bytea stores the ascii of the character '0'. When querying output, the bit string type outputs the character '0', and bytea outputs \x30

SQL Server - best way to store large bitmask?

I need to store a large bitmask (potentially up to 3000 bits long) in a sql server table.
I have the data as a string of 1s and 0s.
The requirements:
The field that I store it in needs to be indexable
I need to be able to compare the stored value with other stored values to see if they are equal
I do NOT need to do any bitwise operations
I need any conversion from the bitmask to the stored value and back again to be fast
The length of each string of 1s and 0s is unknown, but there is a max length (of 3000)
I have millions of these to store
What is the best way to store this data? I was looking at using a varbinary field, but it doesn't seem to be particularly neat/quick ways of converting a string of bits to a hex value.

Storing hexidecimal values

I'd like to store this value efficiently in MSSQL 2016:
6d017ed2a48846f0ac025dd8603902c7
i.e, Fixed-length, ranging from 0 to f, hexidecimal, right?.
Char(32) seems too expensive.
Any kind of help would be appreciated. Thank you!
In almost all cases you shouldn't store this as a string at all. SQL Server has binary and varbinary types.
This string represents a 16-byte binary value. If the expected size is fixed, it can be stored as a binary(16). If the size changes, it can be stored as a varbinary(N) where N is the maximum expected size.
Don't use varbinary(max), that's meant to store BLOBs and has special storage and indexing characteristics.
Storing the string itself would make sense in few cases, eg if it's a hash string used in an API, or it's meant to be shown to humans. In this case, the data will always come as a string and will always have to be converted to a string to be used. In this case the constant conversions will probably cost more than the storage benefits.

packing two single precision values into one

I'm in labview working with very constrained ram. I have two arrays that require single precision since I need decimal points. However, single precision takes too much space for what I have, the decimal values I work with are within 0.00-1000.00.
Is there an intuitive way to pack these two arrays together so I can save some space? Or is there a different approach I can take?
If you need to represent 0.00 - 1000.00, you've got 100000 values. That cannot be represented in less than 17 (whole) bits. That means that to fit two numbers in, you'll need 34 bits. 34 bits is obviously more than you can fit in a 32 bit space. I suggest you try to limit your space of values. You could dedicate 11 bits to the integer value (0 - 1023) and 5 bits to the decimal value (0 to 0.96875 in chunks of 1/32 or 0.03125). Then you'll be able to fit two decimal values into one 32 bit space.
Just remember the extra bit manipulation you have to do for this is likely to have a small performance impact on your application.
First of all it would be good general advice to double check you've correctly understood how LabVIEW stores data in memory and whether any of your VI's are using more memory than they need to.
If you still need to squeeze this data into the minimum space, you could do something like:
Instead of a 1D array of n values, use a 2D array of ceiling(n/16) x 17 U16's. Each U16 is going to hold one bit from each of 16 of your data values.
To read value m from the array, get the 17 U16's from row m/16 of the array and get bit (m MOD 16) from each U16, then combine them to create the value you need.
To write to the array, get the relevant 17 U16's, replace the relevant bit of each with the bits representing the new value, and replace the changed U16's in the array.
I guess this won't be fast but maybe you can optimise it for the particular operations you need to do on this data.
Alternatively, could you perhaps use some sort of data compression? I imagine that would work best if you can organise the data into 'pages' containing some set number of values. For example, you could take a 1D array of SGL, flatten it to a string, then apply the compression to the string, and store the compressed string in a string array. I believe OpenG includes zip tools, for example.

SQL Server data types: Store 8-digit unsigned number as INT or as CHAR(8)?

i think the title says everything.
Is it better(faster,space-saving according memory and disk) to store 8-digit unsigned numbers as Int or as char(8) type?
Would i get into trouble when the number will change to 9 digits in future when i use a fixed char-length?
Background-Info: i want to store TACs
Thanks
Given that TACs can have leading zeroes, that they're effectively an opaque identifier, and are never calculated with, use a char column.
Don't start optimizing for space before you're sure you've modelled your data types correctly.
Edit
But to avoid getting junk in there, make sure you apply a CHECK constraint also. E.g if it's meant to be 8 digits, add
CONSTRAINT CK_Only8Digits CHECK (not TAC like '%[^0-9]%' and LEN(RTRIM(TAC)) = 8)
If it is a number, store it as a number.
Integers are stored using 4 bytes, giving them the range:
-2^31 (-2,147,483,648) to 2^31-1 (2,147,483,647)
So, suitable for your needs.
char[8] will be stored as 8 bytes, so double the storage, and of course suffers from the need to expand in the future (converting almost 10M records from 8 to 9 chars will take time and will probably require taking the database offline for the duration).
So, from storage, speed, memory and disk usage (all related to the number of bytes used for the datatype), readability, semantics and future proofing, int wins hands down on all.
Update
Now that you have clarified that you are not storing numbers, I would say that you will have to use char in order to preserve the leading zeroes.
As for the issue with future expansion - since char is a fixed length field, changing from char[8] to char[9] would not lose information. However, I am not sure if the additional character will be added on the right or left (though this is possibly undetermined). You will have to test and once the field has been expanded you will need to ensure that the original data has been preserved.
A better way may be to create a new char[9] field, migrate all the char[8] data to it (to keep things reliable and consistent), then remove the char[8] field and rename the new field to the original name. Of course this would ruin all statistics on the table (but so would expanding the field directly).
An int will use less memory space and give faster indexing than a char.
If you need to take these numbers apart -- search for everything where digits 3-4 are "02" or some such -- char would be simpler and probably faster.
I gather you're not doing arithmetic on them. You'd not adding two TACs together or finding the average TAC for a set of records or anything like that. If you were, that would be a slam-dunk argument for using int.
If they have leading zeros, its probably easier to use char so you don't have to always pad the number with zeros to the correct length.
If none of the above applies, it doesn't matter much. I'd probably use char. I can't think of a compelling reason to go either way.
Stick to INT for this one, DEFFINITELY INT (OR BIGINT)
Have a look at int, bigint, smallint, and tinyint (Transact-SQL)
Integer (whole number) data from -2^31
(-2,147,483,648) through 2^31 - 1
(2,147,483,647). Storage size is 4
bytes,
bigint (whole number) data from -2^63
(-9,223,372,036,854,775,808) through
2^63-1 (9,223,372,036,854,775,807).
Storage size is 8 bytes.
compared to
char and varchar
Fixed-length non-Unicode character
data with length of n bytes. n must be
a value from 1 through 8,000. Storage
size is n bytes.
Also, once you query against this, you will have degraded performance if you use ints compared to your char column, as SQL Server will have to do as cast for you...

Resources