Changing char(8) to char(32) in SQL Server - sql-server

We have a table T which contains a several char(8) columns (implicitly) which under some conditions need to be changed into something like char(64).
We don't want to waste space, so here is the question:
Is it an expensive operation from the RDBMS computational point of view (extending column data type)? We'd like to have this answered theoretically, no benchmarks. Does database need to rearrange the physical layout of the table because of this?

Yes, quite expensive - every single row in that table must be touched, modified, stored again, and all the non-clustered indices using any of those columns will need to be rebuilt.
Since it's CHAR(x), it's fixed-width - so changing its size results in every single column having to be modified. Also: with the change from 8 to 64 characters, there's a chance that some pages won't be able to hold all the rows anymore and page splits with all their overhead will occur.

If you increase the length of char datatype then data page split will be required and it will be costly in terms of CPU utilization and memory optimization and data compression.

Related

Why PostgreSQL(timescaledb) costs more storage in table?

I'm new to database. Recently I start using timescaledb, which is an extension in PostgreSQL, so I guess this is also PostgreSQL related.
I observed a strange behavior. I calculated my table structure, 1 timestamp, 2 double, so totally 24bytes per row. And I imported (by psycopg2 copy_from) 2,750,182 rows from csv file. I manually calculated the size should be 63MB, but I query timescaledb, it tells me the table size is 137MB, index size is 100MB and total 237MB. I was expecting that the table size should equal my calculation, but it doesn't. Any idea?
There are two basic reasons your table is bigger than you expect:
1. Per tuple overhead in Postgres
2. Index size
Per tuple overhead: An answer to a related question goes into detail that I won't repeat here but basically Postgres uses 23 (+padding) bytes per row for various internal things, mostly multi-version concurrency control (MVCC) management (Bruce Momjian has some good intros if you want more info). Which gets you pretty darn close to the 137 MB you are seeing. The rest might be because of either the fill factor setting of the table or if there are any dead rows still included in the table from say a previous insert and subsequent delete.
Index Size: Unlike some other DBMSs Postgres does not organize its tables on disk around an index, unless you manually cluster the table on an index, and even then it will not maintain the clustering over time (see https://www.postgresql.org/docs/10/static/sql-cluster.html). Rather it keeps its indices separately, which is why there is extra space for your index. If on-disk size is really important to you and you aren't using your index for, say, uniqueness constraint enforcement, you might consider a BRIN index, especially if your data is going in with some ordering (see https://www.postgresql.org/docs/10/static/brin-intro.html).

The size of a column in the database can slow a query?

I have a table with a column contains HTML content and is relative greater than the other columns.
Having a column with a great size can slow the queries in this table?
I need to put this big fields in another table?
The TOAST Technique should handle this for you, after a given size the storage will be transparently set in a _toast table and some internal things are done to avoid slowing down your queries (check the given link).
But of course if you always retrieve the whole content you'll loose time in the retrieval. And it's also clear that requests on this table where this column is not used won't suffer from this column size.
The bigger the database the slower the queries. Always.
It's likely that if you have large column, there is going to be more disk I/O since caching the column itself takes more space. However, putting these in a different table won't likely alleviate this issue (other than the issue below). When you don't explicitly need the actual HTML data, be sure not to SELECT it.
Sometimes the ordering of the columns can matter because of the way rows are stored, if you're really worried about it, store it as the last column so it doesn't get paged when selecting other columns
You would have to look at how Postgres internally stores things to see if you need to split this out but a very large field could cause the way the data is stored on the disk to be broken up and thus adds to the time it takes to access it.
Further, returning 100 bytes of data vice 10000 bytes of data for one record is clearly going to be slower, the more records the slower. If you are doing SELECT * this is clearly a problem espcially if you usually do not need the HTML.
Another consideration could be ptting the HTML information in a noSQL database. This kind of document information is what they excel at. No reason you can't use both a realtional database for some info and a noSQL database for other info.

How are varchar values stored in a SQL Server database?

My fellow programmer has a strange requirement from his team leader; he insisted on creating varchar columns with a length of 16*2n.
What is the point of such restriction?
I can suppose that short strings (less than 128 chars for example) a stored directly in the record of the table and from this point of view the restriction will help to align fields in the record, larger strings are stored in the database "heap" and only the reference to this string is saved in the table record.
Is it so?
Is this requirement has a reasonable background?
BTW, the DBMS is SQL Server 2008.
Completely pointless restriction as far as I can see. Assuming standard FixedVar format (as opposed to the formats used with row/page compression or sparse columns) and assuming you are talking about varchar(1-8000) columns
All varchar data is stored at the end of the row in a variable length section (or in offrow pages if it can't fit in row). The amount of space it consumes in that section (and whether or not it ends up off row) is entirely dependant upon the length of the actual data not the column declaration.
SQL Server will use the length declared in the column declaration when allocating memory (e.g. for sort operations). The assumption it makes in that instance is that varchar columns will be filled to 50% of their declared size on average so this might be a better thing to look at when choosing a size.
I have heard of this practice before, but after researching this question a bit I don't think there is a practical reason for having varchar values in multiples of 16. I think this requirement probably comes from trying to optimize the space used on each page. In SQL Server, pages are set at 8 KB per page. Rows are stored in pages, so perhaps the thinking is that you could conserve space on the pages if the size of each row divided evenly into 8 KB (a more detailed description of how SQL Server stores data can be found here). However, since the amount of space used by a varchar field is determined by its actual content, I don't see how using lengths in multiples of 16 or any other scheme could help you optimize the amount of space used by each row on the page. The length of the varchar fields should just be set to whatever the business requirements dictate.
Additionally, this question covers similar ground and the conclusion also seems to be the same:
Database column sizes for character based data
You should always store the data in the data size that matches the data being stored. It is part of how the database can maintain integrity. For instance suppose you are storing email addresses. If your data size is the size of the maximum allowable emailaddress, then you will not be able to store bad data that is larger than that. That is a good thing. Some people want to make everything nvarchar(max) or varchar(max). However, this causes only indexing problems.
Personally I would have gone back to the person who make this requirement and asked for a reason. Then I would have presented my reasons as to why it might not be a good idea. I woul never just blindly implement something like this. In pushing back on a requirement like this, I would first do some research into how SQL Server organizes data on the disk, so I could show the impact of the requirement is likely to have on performance. I might even be surprised to find out the requirement made sense, but I doubt it at this point.

Does the order of columns in the table matter?

We have a number of projects big and small - most (if not all) of them use at least one SQL Server DB. All of them have different environments set up. Typically: dev (1+), QA, UAT, Live.
It is also common for us to release various code updates to different environments independently of each other. Naturally some of those updates come with schema update scripts such as
alter table foo add column bar
go
update foo set bar=... where ...
Sometimes made by hand, other times using Red Gate SQL/Data Compare.
Anyway where I'm going with this is that often different environments for the same project end up with different order of columns. Is this a problem? I don't really know...
Does column order have any performance implications? Anything I could be missing?
No, column order is not significant. In actuality, the order that column data is stored on disk may itself be different than the order you see in client tools, as the engine reorders the data to optimize storage space and read/write performance (putting multiple bit fields into a single memory location, aligning columns on memory boundaries, etc.. )
Not really - in 95% of your cases, there's no difference in the column ordering. And from a relational theoretical point of view, column order in a table is irrelevant anyway.
There are a few edge cases where column order might have a slight impact on your table, most often when you have a large number of variable size fields (like VARCHAR). But that number needs to be really large, and your fields (their size) needs to be really massive - in such a case, it can be beneficial to put those variable size fields at the end of the table in terms of ordering the columns.
But again: that's really more of a rare edge case, rather than the norm.
Also, mind you: SQL Server has no means of reordering columns, really. You can do that in the visual table designer - but what SQL Server does under the covers is create a new table with the desired column ordering, and then all the data from the old table is copied over. That's the reason this is a very tedious and time consuming operation for large tables.

Performance implication on nvarchar(4000)?

I have a column that is declared as nvarchar(4000) and 4000 is SQL Server's limit on the length of nvarchar. it would not be used as a key to sort rows.
Are there any implication that i should be aware of before setting the length of the nvarchar field to 4000?
Update
I'm storing an XML, an XML Serialized Object to be exact. I know this isn't favorable but we will most likely never need to perform queries on it and this implementation dramatically decreases development time for certain features that we plan on extending. I expect the XML data to be 1500 characters long on average but there can be those exceptions where it can be longer than 4000. Could it be longer than 4000 characters? It could be but in very rare occasions, if it ever happens. Is this application mission critical? Nope, not at all.
SQL Server has three types of storage: in-row, LOB and Row-Overflow, see Table and Index Organization. The in-row storage is fastest to access. LOB and Row-Overflow are similar to each other, both slightly slower than in-row.
If you have a column of NVARCHAR(4000) it will be stored in row if possible, if not it will be stored in the row-overflow storage. Having such a column does not necesarily indicate future performance problems, but it begs the question: why nvarchar(4000)? Is your data likely to be always near 4000 characters long? Can it be 4001, how will your applicaiton handle it in this case? Why not nvarchar(max)? Have you measured performance and found that nvarchar(max) is too slow for you?
My recommendation would be to either use a small nvarchar length, appropiate for the real data, or nvarchar(max) if is expected to be large. nvarchar(4000) smells like unjustified and not tested premature optimisation.
Update
For XML, use the XML data type. It has many advantages over varchar or nvarchar, like the fact that it supports XML indexes, it supports XML methods and can actually validate the XML for a compliance to a specific schema or at least for well-formed XML compliance.
XML will be stored in the LOB storage, outside the row.
Even if the data is not XML, I would still recommend LOB storage (nvarchar(max)) for something of a length of 1500. There is a cost associated with retrieving the LOB stored data, but the cost is more than compensated by macking the table narrower. The width of a table row is a primary factor of performance, because wider tables fit less rows per page, so any operation that has to scan a range of rows or the entire table needs to fetch more pages into memory, and this shows up in the query cost (is actualy the driving factor of the overall cost). A LOB stored column only expands the size of the row with the width of a 'page id', which is 8 bytes if I remember correctly, so you can get much better density of rows per page, hence faster queries.
Are you sure that you'll actually need as many as 4000 characters? If 1000 is the practical upper limit, why not set it to that? Conversely, if you're likely to get more than 4000 bytes, you'll want to look at nvarchar(max).
I like to "encourage" users not use storage space too freely. The more space required to store a given row, the less space you can store per page, which potentially results in more disk I/O when the table is read or written to. Even though only as many bytes of data are stored as are necessary (i.e. not the full 4000 per row), whenever you get a bit more than 2000 characters of nvarchar data, you'll only have one row per page, and performance can really suffer.
This of course assumes you need to store unicode (double-byte) data, and that you only have one such column per row. If you don't, drop down to varchar.
Do you certainly need nvarchar or can you go with varchar? The limitation applies mainly to sql server 2k. Are you using 2k5 / 2k8 ?

Resources