I'd like to store this value efficiently in MSSQL 2016:
6d017ed2a48846f0ac025dd8603902c7
i.e, Fixed-length, ranging from 0 to f, hexidecimal, right?.
Char(32) seems too expensive.
Any kind of help would be appreciated. Thank you!
In almost all cases you shouldn't store this as a string at all. SQL Server has binary and varbinary types.
This string represents a 16-byte binary value. If the expected size is fixed, it can be stored as a binary(16). If the size changes, it can be stored as a varbinary(N) where N is the maximum expected size.
Don't use varbinary(max), that's meant to store BLOBs and has special storage and indexing characteristics.
Storing the string itself would make sense in few cases, eg if it's a hash string used in an API, or it's meant to be shown to humans. In this case, the data will always come as a string and will always have to be converted to a string to be used. In this case the constant conversions will probably cost more than the storage benefits.
Related
I'm working on a database that has a VARBINARY(255) column that doesn't make sense to me. Depending on the length of the value, the value is either numbers or words.
For whatever number is stored, it is a 4-byte hex string 0x00000000, but reads left to right while the bytes read right to left. So for a number such as 255, it is 0xFF000000 and for a number such as 745, it is 0xE9020000. This is the part that I do not understand, why is it stored that way instead of 0x02E9, 0x2E9 or 0x000002E9?
When it comes to words, each character is stored as a 4-byte hex string just like above. Something like a space is stored as 0x20000000, but a word like Sensor it is 0x53000000650000006E000000730000006F00000072000000 instead of just 0x53656E736F72.
Can anyone explain to me why the data is stored in this way? Is everything represented as 4-byte strings because the numbers stored can be the full 4-bytes while text is padded with zeros for consistency? Why are the zeros padded to the right of the value? Why are the values stored with the 4th byte first and 1st byte last?
If none of this makes sense from an SQL standpoint, I suppose it is possible that the data is being provided this way from the client application which I do not have access to the source on. Could that be the case?
Lastly, I would like to create a report that includes this column, but converted to the correct numbers or words. Is there a simpler and more performant method than using substrings, trims, and recursion?
With the help of Smor in the comments above, I can now answer my own questions.
The client application provides the 4-byte strings and the database just takes them as they fit within the column's VARBINARY(255) data type and length. Since the application is providing the values in a little-endian format, they are stored in that way within the database with the least significant byte first and the most significant byte last. Being that most values are smaller than the static 4-byte length, the values are padded with zeros to the right to fit the 4-byte requirement.
Now as to my question of the report, this is what I came up with:
CASE
WHEN LEN(ByteValue) <= 4
THEN CAST(CAST(CAST(REVERSE(ByteValue) AS VARBINARY(4)) AS INT) AS VARCHAR(12))
ELSE CAST(CONVERT(VARBINARY(255),REPLACE(CONVERT(VARCHAR(255),ByteValue,1),'000000',''),1) AS VARCHAR(100))
END AS PlainValue
In my particular case, only numbers are stored as just 4-byte or less values while words are stored as much longer values. This allows me to break the smaller values into numbers while longer values are broken down into words.
Using CASE WHEN I can specify that only data 4-bytes or less needs the REVERSE() function as it is the easiest way to convert the little-endian format to the big-endian format that SQL is looking for when converting from hex to integers. Due to the REVERSE() function returning a NVARCHAR datatype, I then have to convert that back to VARBINARY, then to INT, then to VARCHAR to match the datatype of the second case datatype.
Any string longer than 4-bytes, used specifically for words, falls under the ELSE part and allows me to strip the extra zeros from the hex value so I get just the first byte of each 4-byte long character (the only part that matters in my situation). By converting the hex string to VARCHAR, I can then easily remove the 6 repeating zeros using the REPLACE() function. With the zeros gone, converting the string back to VARBINARY allows converting to VARCHAR to be done with ease.
I have a question in regards to data types that are available in SQL language to store data into the database itself. Since I'm dealing with database that is quite large, and has a tendency to expand over 150GB+ of data, I need to pay close attention and save up every bit of space on the server's hard drive so that the database doesn't takes up all the precious space. So my question is as following:
Which data type is the best to store 80-200 character long string in database?
I'm aware of for example varchar(200) and nvarchar(200) where the nvarchar supports unicode character. Which one of these would take up less space in database, or if there's a 3rd data type that I'm not aware of, and which I could use to store the data (if I know for a fact that the string I would store is just a combination of numbers and letters, without any special characters)
Are there some other techniques that I could use to save up space in database so that it doesn't expands rapidly ?
Can someone help me out with this ?
P.S. Guys, I have a 4th question as well:
If for example I have nvarchar(max) data type which is in a table, and the entered record takes up only 100 characters, how much data is reserved for that kind of record?
Let's say that I have ID which is of following form 191697193441 ... Would it make more sense to store this number as varchar(200) or bigint ?
The size needed for nvarchar is 2 bytes per character, as it represents unicode data. varchar needs 1 byte per character. The storage size is the actual number of characters entered + 2 bytes overhead. This is also true for varchar(max).
From https://learn.microsoft.com/en-us/sql/t-sql/data-types/char-and-varchar-transact-sql:
varchar [ ( n | max ) ] Variable-length, non-Unicode string data. n defines the string length and can be a value from 1 through 8,000. max indicates that the maximum storage size is 2^31-1 bytes (2 GB). The storage size is the actual length of the data entered + 2 bytes.
So for your 4th question, nvarchar would need 100 * 2 + 2 = 202 bytes, varchar would need 100 * 1 + 2 = 102 bytes.
There's no performance or data size difference as they're variable length data types, so they'll only use the space they need.
Think of the size parameter as more of a useful constraint. For e.g. if you have a surname field, you can reasonably expect 50 characters to be a sensible maximum size and you have more chance of a mistake (misuse of the field, incorrect data capture etc.) throwing an error, rather than adding nonsense to the database and needing future data cleansing.
So, my general rule of thumb is make them as large as the business requirements demand, but no larger. It's trivial to amend variable data sizes to a larger length in the future if requirements change.
i think the title says everything.
Is it better(faster,space-saving according memory and disk) to store 8-digit unsigned numbers as Int or as char(8) type?
Would i get into trouble when the number will change to 9 digits in future when i use a fixed char-length?
Background-Info: i want to store TACs
Thanks
Given that TACs can have leading zeroes, that they're effectively an opaque identifier, and are never calculated with, use a char column.
Don't start optimizing for space before you're sure you've modelled your data types correctly.
Edit
But to avoid getting junk in there, make sure you apply a CHECK constraint also. E.g if it's meant to be 8 digits, add
CONSTRAINT CK_Only8Digits CHECK (not TAC like '%[^0-9]%' and LEN(RTRIM(TAC)) = 8)
If it is a number, store it as a number.
Integers are stored using 4 bytes, giving them the range:
-2^31 (-2,147,483,648) to 2^31-1 (2,147,483,647)
So, suitable for your needs.
char[8] will be stored as 8 bytes, so double the storage, and of course suffers from the need to expand in the future (converting almost 10M records from 8 to 9 chars will take time and will probably require taking the database offline for the duration).
So, from storage, speed, memory and disk usage (all related to the number of bytes used for the datatype), readability, semantics and future proofing, int wins hands down on all.
Update
Now that you have clarified that you are not storing numbers, I would say that you will have to use char in order to preserve the leading zeroes.
As for the issue with future expansion - since char is a fixed length field, changing from char[8] to char[9] would not lose information. However, I am not sure if the additional character will be added on the right or left (though this is possibly undetermined). You will have to test and once the field has been expanded you will need to ensure that the original data has been preserved.
A better way may be to create a new char[9] field, migrate all the char[8] data to it (to keep things reliable and consistent), then remove the char[8] field and rename the new field to the original name. Of course this would ruin all statistics on the table (but so would expanding the field directly).
An int will use less memory space and give faster indexing than a char.
If you need to take these numbers apart -- search for everything where digits 3-4 are "02" or some such -- char would be simpler and probably faster.
I gather you're not doing arithmetic on them. You'd not adding two TACs together or finding the average TAC for a set of records or anything like that. If you were, that would be a slam-dunk argument for using int.
If they have leading zeros, its probably easier to use char so you don't have to always pad the number with zeros to the correct length.
If none of the above applies, it doesn't matter much. I'd probably use char. I can't think of a compelling reason to go either way.
Stick to INT for this one, DEFFINITELY INT (OR BIGINT)
Have a look at int, bigint, smallint, and tinyint (Transact-SQL)
Integer (whole number) data from -2^31
(-2,147,483,648) through 2^31 - 1
(2,147,483,647). Storage size is 4
bytes,
bigint (whole number) data from -2^63
(-9,223,372,036,854,775,808) through
2^63-1 (9,223,372,036,854,775,807).
Storage size is 8 bytes.
compared to
char and varchar
Fixed-length non-Unicode character
data with length of n bytes. n must be
a value from 1 through 8,000. Storage
size is n bytes.
Also, once you query against this, you will have degraded performance if you use ints compared to your char column, as SQL Server will have to do as cast for you...
I am using character varying data type in PostgreSQL.
I was not able to find this information in PostgreSQL manual.
What is max limit of characters in character varying data type?
Referring to the documentation, there is no explicit limit given for the varchar(n) type definition. But:
...
In any case, the longest possible
character string that can be stored is
about 1 GB. (The maximum value that
will be allowed for n in the data type
declaration is less than that. It
wouldn't be very useful to change this
because with multibyte character
encodings the number of characters and
bytes can be quite different anyway.
If you desire to store long strings
with no specific upper limit, use text
or character varying without a length
specifier, rather than making up an
arbitrary length limit.)
Also note this:
Tip: There is no performance
difference among these three types,
apart from increased storage space
when using the blank-padded type, and
a few extra CPU cycles to check the
length when storing into a
length-constrained column. While
character(n) has performance
advantages in some other database
systems, there is no such advantage in
PostgreSQL; in fact character(n) is
usually the slowest of the three
because of its additional storage
costs. In most situations text or
character varying should be used
instead.
From documentation:
In any case, the longest possible character string that can be stored is about 1 GB.
character type in postgresql
character varying(n), varchar(n) = variable-length with limit
character(n), char(n) = fixed-length, blank padded
text = variable unlimited length
based on your problem I suggest you to use type text. the type does not require character length.
In addition, PostgreSQL provides the text type, which stores strings of any length. Although the type text is not in the SQL standard, several other SQL database management systems have it as well.
source : https://www.postgresql.org/docs/9.6/static/datatype-character.html
The maximum string size is about 1 GB. Per the postgres docs:
Very long values are also stored in background tables so that they do not interfere with rapid access to shorter column values. In any case, the longest possible character string that can be stored is about 1 GB. (The maximum value that will be allowed for n in the data type declaration is less than that.)
Note that the max n you can specify for varchar is less than the max storage size. While this limit may vary, a quick check reveals that the limit on postgres 11.2 is 10 MB:
psql (11.2)
=> create table varchar_test (name varchar(1073741824));
ERROR: length for type varchar cannot exceed 10485760
Practically speaking, when you do not have a well rationalized length limit, it's suggested that you simply use varchar without specifying one. Per the official docs,
If you desire to store long strings with no specific upper limit, use text or character varying without a length specifier, rather than making up an arbitrary length limit.
I have a form that records a student ID number. Some of those numbers contain a leading zero. When the number gets recorded into the database it drops the leading 0.
The field is set up to only accept numbers. The length of the student ID varies.
I need the field to be recorded and displayed with the leading zero.
If you are always going to have a number of a certain length (say, it will always be 10 characters), then you can just get the length of the number in the database (after it is converted to a string) and then add the appropriate 0's.
However, if this is an arbitrary amount of leading zeros, then you will have to store the content as a string in the database so you can capture the leading zeros.
It sounds like this should be stored as string data. It sounds like the leading zeros are part of the data itself, not just part of it's formatting.
You could reformat the data for display with the leading zeros in it, however I believe you should store the correct form of the ID number, it will lead to less bugs down the road (ex: you forgot to format it in one place but not in another).
There are a few ways of doing this - depending on the answers to my comments in your question:
Store the extra data in the database by converting the datatype from numeric to varchar/string.
Advantages: Very simple in its implementation; You can treat all the values in the same way.
Disadvantage: If you've got very large amounts of data, storage sizes will escalate; indexing and sorting on strings doesn't perform so well.
Use if: Each number may have an arbitrary length (and hence number of zeros).
Don't use if: You're going to be spending a lot of time sorting data, sorting numeric strings is a pain in the ass - look up natural sorting to see some of the pitfalls;
Continue to store the data in the database as numeric but pad the numeric back to a set length (i.e. 10 as I have suggested in my example below):
Advantages: Data will index better, search better, not require such large amounts of storage if you've got large amounts of data.
Disadvantage: Every query or display of data will require every data instance to be padded to the correct length causing a slight performance hit.
Use if: All the output numbers will be the same length (i.e. including zeros they're all [for example] 10 digits); Large amounts of sorting will be necessary.
Add a field to your table to store the original length of the numeric, continue to store the value as numeric (to leverage sorting/indexing performance gains of numeric vs. string) in your new field store the length as it would include the significant zeros:
Advantages: Reduction in required storage space; maximum use of indexing; sorting of numerics is far easier than sorting text numerics; You still get the ability to pad numerics to arbitrary lengths like you have with option 1.
Disadvantages: An extra field is required in your database, so all your queries will have to pull that extra field thus potentially requiring a slight increase in resources at query/display time.
Use if: Storage space/indexing/sorting performance is any sort of concern.
Don't use if: You don't have the luxury of changing the table structure to include the extra value; This will overcomplicate already complex queries.
If I were you and I had access to modify the db structure slightly, I'd go with option 3, sure you need to pull out an extra field to get the length. The slightly increased complexity pays huge dividends in the advantages versus the disadvantages. The performance hit of padding the string back out the correct length will be far superceded by the performance increase of the indexing and storage space required.
I worked with a database with a similar problem. They were storing zip codes as a number. The consequence was that people in New Jersey couldn't use our app.
You're using data that is logically a text string and not a number. It just happens to look like a number, but you really need to treat it as text. Use a text-oriented data type, or at least create a database view that enables you to pull back a properly formatted value for this.
See here: Pad or remove leading zeroes from numbers
declare #recordNumber integer;
set #recordNumber = 93088;
declare #padZeroes integer;
set #padZeroes = 8;
select
right( replicate('0',#padZeroes)
+ convert(varchar,#recordNumber), #padZeroes);
Unless you intend on doing calculations on that ID, its probably best to store them as text/string.
Another option is since the field is an id, i would recommend creating a secondary field for display number (nvarchar) that you can use for reports, etc...
Then in your application when the student id is entered you can insert that into the database as the number, as well as the display number.
An Oracle solution
Store the ID as a number and convert it into a character for display. For instance, to display 42 as a zero-padded, three-character string:
SQL> select to_char(42, '099') from dual;
042
Change the format string to fit your needs.
(I don't know if this is transferable to other SQL flavors, however.)
You could just concatenate '1' to the beginning of the ID when storing it in the database. When retrieving it, treat it as a string and remove the first char.
MySQL Example:
SET #student_id = '123456789';
INSERT INTO student_table (id,name) VALUES(CONCAT('1',#student_id),'John Smith');
...
SELECT SUBSTRING(id,1) FROM student_table;
Mathematically:
Initially I thought too much and did it mathematically by adding an integer to the student ID, depending on its length (like 1,000,000,000 if it's 9 digits), before storing it.
SET #new_student_id = ABS(#student_id) + POW(10, CHAR_LENGTH(#student_id));
INSERT INTO student_table (id,name) VALUES(#new_student_id,'John Smith');