PostgreSQL allocated column length - database

I have an issue where I have one column in a database that might be anything from 10 to 10,000 bytes in size. Do you know if PostgreSQL supports sparse data (i.e. will it always set aside the 10,000 bytes fore every entry in the column ... or only the space that is required for each entry)?

Postgres will store long varlena types in an extended storage called TOAST.
In the case of strings, it keep things inline up to 126 bytes (potentially means less 126 characters for multibyte stuff), and then sends it to the external storage.
You can see where the data is stored using psql:
\dt+ yourtable
As an aside, note that from Postgres' standpoint, there's absolutely no difference (with respect to storage) between declaring a column's type as varchar or varchar(large_number) -- it'll be stored in the exact same way. There is, however, a very slight performance penalty in using varchar(large_number) because of the string length check.

use varchar or text types - these use only the space actually required to store the data (plus a small overhead of 2 bytes for each value to store the length)

Related

Is there much to be gained by have a column which might hold up to ten characters as a VARCHAR?

I have a simple Id column in my database. It can contain information like U001-01 or perhaps something a little more later on.
I am thinking it will be about ten characters and I would like to have an index on this column.
Is there really much to be gained by having this as a VARCHAR(10) instead of a CHAR(10). Note that already my rows will be over 1000 bytes long.
In general, I would recommend using varchar() instead of char(), unless you really want your values padded with spaces at the end. Spaces can make it cumbersome to combine the field with other fields using concatenation. It can also get confusing to remember whether and when the extra spaces matter for comparison purposes.
The additional two bytes of overhead is usually insignificant. After all, if your average value length is less than 8 (n - 2), then the overall storage is still less with a varchar() versus char() representation.
In general, I default to varchar(). If I know a coding is fixed length (US state codes, ISO country codes, 9-digit US zip codes, social security numbers), then I will consider a char() instead.
As my unfinished comment suggests, CHAR uses static type allocation so is more efficient than VARCHAR which uses dynamic type allocation. CHAR columns are space padded, so this must be considered when performing comparisons.
The index is effective if it is VARCHAR(10) or a CHAR(10)
Not a lot to save on 10 from a size perspective
Varchar has 2 bytes of overhead
Char reserves space so changing or later inserting the value will not cause page splits (fragmentation)
The page split takes time and fragmentation slows performance of the table
I typically use char for 40 and under just to avoid page splits
char and varchar (Transact-SQL)
If you use char or varchar, we recommend the following:
Use char when the sizes of the column data entries are consistent.
Use varchar when the sizes of the column data entries vary
considerably.
Use varchar(max) when the sizes of the column data entries vary
considerably, and the size might exceed 8,000 bytes.

Why specify a length for character varying types

Referring to the Postgres Documentation on Character Types, I am unclear on the point of specifying a length for character varying (varchar) types.
Assumption:
the length of string doesn't matter to the application.
you don't care that someone puts that maximum size in the database
you have unlimited hard disk space
It does mention:
The storage requirement for a short string (up to 126 bytes) is 1 byte
plus the actual string, which includes the space padding in the case
of character. Longer strings have 4 bytes of overhead instead of 1.
Long strings are compressed by the system automatically, so the
physical requirement on disk might be less. Very long values are also
stored in background tables so that they do not interfere with rapid
access to shorter column values. In any case, the longest possible
character string that can be stored is about 1 GB. (The maximum value
that will be allowed for n in the data type declaration is less than
that. It wouldn't be useful to change this because with multibyte
character encodings the number of characters and bytes can be quite
different.
This talks about the size of string, not the size of field, (i.e. sounds like it will always compress a large string in a large varchar field, but not a small string in a large varchar field?)
I ask this question as it would be much easier (and lazy) to specify a much larger size so you never have to worry about having a string too large. For example, if I specify varchar(50) for a place name I will get locations that have more characters (e.g. Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch), but if I specify varchar(100) or varchar(500), I'm less likley to get that problem.
So would you get a performance hit between varchar(500) and (arbitrarily) varchar(5000000) or text() if your largest string was say 400 characters long?
Also out of interest if anyone has the answer to this AND knows the answer to this for other databases, please add that too.
I have googled, but not found a sufficiently technical explanation.
My understanding is that having constraints is useful for data integrity, therefore I use column sizes to both validate the data items at the lower layer, and to better describe the data model.
Some links on the matter:
VARCHAR(n) Considered Harmful
CHAR(x) vs. VARCHAR(x) vs. VARCHAR vs. TEXT
In Defense of varchar(x)
My understanding is that this is a legacy of older databases with storage that wasn't as flexible as that of Postgres. Some would use fixed-length structures to make it easy to find particular records and, since SQL is a somewhat standardized language, that legacy is still seen even when it doesn't provide any practical benefit.
Thus, your "make it big" approach should be an entirely reasonable one with Postgres, but it may not transfer well to other less flexible RDBMS systems.
The documentation explains this:
If character varying is used without length specifier, the type accepts strings of any size. The latter is a PostgreSQL extension.
The SQL standard requires a length specification for all its types. This is probably mainly for legacy reasons. Among PostgreSQL users, the preference tends to be to omit the length specification, but if you want to write portable code, you have to include it (and pick an arbitrary size, in many cases).
Two more thoughts:
The Postgres doc says that 'very long values are also stored in background tables'. Thus, defining all strings as unbounded likely pushes them into background tables -- for sure a performance hit.
Declaring everything as very long interferes with the DB's efforts to predict a query execution plan, because it has less knowledge of the data.
Building a b-tree to contain an index would also be thrown off because it would not be able to guess a reasonable packing strategy. For example if gender was TEXT, how would you know it's all only M or F?

How are varchar values stored in a SQL Server database?

My fellow programmer has a strange requirement from his team leader; he insisted on creating varchar columns with a length of 16*2n.
What is the point of such restriction?
I can suppose that short strings (less than 128 chars for example) a stored directly in the record of the table and from this point of view the restriction will help to align fields in the record, larger strings are stored in the database "heap" and only the reference to this string is saved in the table record.
Is it so?
Is this requirement has a reasonable background?
BTW, the DBMS is SQL Server 2008.
Completely pointless restriction as far as I can see. Assuming standard FixedVar format (as opposed to the formats used with row/page compression or sparse columns) and assuming you are talking about varchar(1-8000) columns
All varchar data is stored at the end of the row in a variable length section (or in offrow pages if it can't fit in row). The amount of space it consumes in that section (and whether or not it ends up off row) is entirely dependant upon the length of the actual data not the column declaration.
SQL Server will use the length declared in the column declaration when allocating memory (e.g. for sort operations). The assumption it makes in that instance is that varchar columns will be filled to 50% of their declared size on average so this might be a better thing to look at when choosing a size.
I have heard of this practice before, but after researching this question a bit I don't think there is a practical reason for having varchar values in multiples of 16. I think this requirement probably comes from trying to optimize the space used on each page. In SQL Server, pages are set at 8 KB per page. Rows are stored in pages, so perhaps the thinking is that you could conserve space on the pages if the size of each row divided evenly into 8 KB (a more detailed description of how SQL Server stores data can be found here). However, since the amount of space used by a varchar field is determined by its actual content, I don't see how using lengths in multiples of 16 or any other scheme could help you optimize the amount of space used by each row on the page. The length of the varchar fields should just be set to whatever the business requirements dictate.
Additionally, this question covers similar ground and the conclusion also seems to be the same:
Database column sizes for character based data
You should always store the data in the data size that matches the data being stored. It is part of how the database can maintain integrity. For instance suppose you are storing email addresses. If your data size is the size of the maximum allowable emailaddress, then you will not be able to store bad data that is larger than that. That is a good thing. Some people want to make everything nvarchar(max) or varchar(max). However, this causes only indexing problems.
Personally I would have gone back to the person who make this requirement and asked for a reason. Then I would have presented my reasons as to why it might not be a good idea. I woul never just blindly implement something like this. In pushing back on a requirement like this, I would first do some research into how SQL Server organizes data on the disk, so I could show the impact of the requirement is likely to have on performance. I might even be surprised to find out the requirement made sense, but I doubt it at this point.

Should Data types be sizes of powers of 2 in SQL Server?

What are good sizes for data types in SQL Server? When defining columns, i see data types with sizes of 50 as one of the default sizes(eg: nvarchar(50), binary(50)). What is the significance of 50? I'm tempted to use sizes of powers of 2, is that better or just useless?
Update 1
Alright thanks for your input guys. I just wanted to know the best way of defining the size of a datatype for a column.
There is no reason to use powers of 2 for performance etc. Data length should be determined by the size stored data.
Why not the traditional powers of 2, minus 1 such as 255...
Seriously, the length should match what you need and is suitable for your data.
Nothing else: how the client uses it, aligns to 32 bit word boundary, powers of 2, birthdays, Scorpio rising in Uranus, roll of dice...
The reason so many fields have a length of 50 is that SQL Server defaults to 50 as the length for most data types where length is an issue.
As has been said, the length of a field should be appropriate to the data that is being stored there, not least because there is a limit to the length of single record in SQL Server (it's ~8000 bytes). It is possible to blow past that limit.
Also, the length of your fields can be considered part of your documentation. I don't know how many times I've met lazy programmers who claim that they don't need to document because the code is self documenting and then they don't bother doing the things that would make the code self documenting.
You won't gain anything from using powers of 2. Make the fields as long as your business needs really require them to be - let SQL Server handle the rest.
Also, since the SQL Server page size is limited to 8K (of which 8060 bytes are available to user data), making your variable length strings as small as possible (but as long as needed, from a requirements perspective) is a plus.
That 8K limit is a fixed SQL Server system setting which cannot be changed.
Of course, SQL Server these days can handle more than 8K of data in a row, using so called "overflow" pages - but it's less efficient, so trying to stay within 8K is generally a good idea.
Marc
The size of a field should be appropriate for the data you are planning to store there, global defaults are not a good idea.
It's a good idea that the whole row fits into page several times without leaving too much free space.
A row cannot span two pages, an a page has 8096 bytes of free space, so two rows that take 4049 bytes each will occupy two pages.
See docs on how to calculate the space occupied by one row.
Also note that VAR in VARCHAR and VARBINARY stands for "varying", so if you put a 1-byte value into a 50-byte column, it will take but 1 byte.
This totally depends on what you are storing.
If you need x chars use x not some arbitrarily predefined amount.

Strategy for storing an string of unspecified length in Sql Server?

So a column will hold some text that beforehand I won't know how long the length of this string can be. Realistically 95% of the time, it will probably be between 100-500 chars, but there can be that one case where it will 10000 chars long. I have no control over the size of this string and never does the user. Besides varchar(max), what other strategy have you guys found useful? Also what are some cons of varchar(max)?
Varchar(max) in sqlserver 2005 is what I use.
SqlServer handles large string fields weirdly, in that if you specify "text" or a large varchar, but not max, it stores part of the bits in the record and the rest outside.
To my knowledge with varchar(max) it goes ahead and stores the entire contents out of the record, which makes it less efficient than a small text input. But its more efficient than a "text" field since it does not have to look up that information 2 times by getting part inline and the rest from a pointer.
One inelegant but effective approach would be to have two columns in your table, one a varchar big enough to cover your majority of cases, and another of a CLOB/TEXT type to store the freakishly large ones. When inserting/updating, you can get the size of your string, and store it in the appropriate column.
Like I say, not pretty, but it would give you the performance of varchar for the majority case, without breaking when you have larger values.
Have you considered using the BLOB type?
Also, out of curiosity, is you don't control the size of the string, and neither does the user, who does?
nvarchar(max) is definitely your best bet - as i'm sure you know it will only allocate the space required for the data you are actually storing per row, not the actual max of the datatype per row.
The only con i would see would be if you are constantly updating a row and it is switching from less than 8000 bytes to > 8000 bytes often in which case SQL will change the storage to a LOB and store a pointer to the data whenever you go over 8000 bytes. Changing back and forth would be expensive in this case, but you don't really have any other options in this case that I can see - so it's kind of a moot point.

Resources