I was reading a book :
For example, when a column is defined as VARCHAR(25), the maximum number of characters supported is 25, but in practice, the actual number of characters in the string determines the amount of storage. Because storage consumption
for these data types is less than that for fixed-length types, read operations are faster. However, updates might result in row expansion, which might result in data movement outside the current page. Therefore, updates of data having variable-length data types are less efficient than updates of data having fixed-length data types.
I can understand that storage consumption for varchar is less than that for char, but why is it slower than char when updating records? What does row expansion mean and what actually happen when row expanses?
Let's say we have a suburb table which has two columns, zipcode char(5) and name varchar, and let's say we need to update a row's record with zipcode to be 10005, and name to be 'NYC', we only set 3 characters for the name column, shouldn't it be more efficient than zipcode column which requires 5 characters?
Rows are laid out with the fixed size columns first, at fixed offsets from the start of the row. Then (after some important bytes in the middle) the variable sized data is placed at the end. Because it's variable sized, the actual offset to the data cannot be computed for the whole table (like the fixed data) but has to be computed on a row-by-row basis.
And if a varchar(5)1 is storing NYC and is then asked to store NYCX, it may find that there's not a spare byte at the end of NYC - it's being used for another column - so the row has to expand by moving everything after one byte further along to make space for the extra byte.
1I notice in one of your examples you failed to specify a length. Please drill into yourself that that's a bad habit
Related
I’m using SQL server 2016 and I have table in my database and table size is 120 GB. It has 300 columns and all columns are NVARCHAR(MAX) and it has 12,00,000 records in it. Mostly 100 columns are NULL all the time or it will have a short value. Here my doubt is why 12,00,000 records taken 120 GB, is it because of datatype?
This a Audit table. This will have CDC historical information.On average this table will get inserted 10,000 records per day. Because on this, my database size is increasing and SQL queries are slow. This is an Audit table and not used for any queries.
Please let me know the reason why my table is very big.
Of course, it depends on how you are measuring the size of the table and what other operations occur.
You are observing about 10,000 bytes per record. That does seem large, but there are things you need to consider.
NVARCHAR(MAX) has a minimum size:
nvarchar [ ( n | max ) ]
Variable-length Unicode string data. n defines the string length and
can be a value from 1 through 4,000. max indicates that the maximum
storage size is 2^31-1 bytes (2 GB). The storage size, in bytes, is
two times the actual length of data entered + 2 bytes. The ISO
synonyms for nvarchar are national char varying and national character
varying.
Even the empty fields occupy 2 bytes plus the nullable flag. With 300 fields, that is 600-plus bytes right there (600 + 600 / 8).
You may also have issues with pages that are only partially filled. This depends on how you insert data, the primary key, and system parameters.
And there are other considerations, depending on how you are measuring the size:
How large are the largest fields?
How often are rows occupying multiple pages (each additional page has additional overhead)?
You are using wide characters, so they may seem larger than they seem.
Is your estimate including indexes?
If you are measuring database size, you may be including log tables.
I would suggest that you have your DBA investigate the table to see if there are any obvious problems, such as many pages that are only partially filled.
Edit: updated answer upon clarification on the number of rows that the table really have.
Taking into account that 120GB are 120,000MB you are getting 100KB per row, that is about 330 bytes for each column on average, which its usually quite higher but not for a table with 300 nvarchar(max) columns (note that the nchar and nvarchar types take 2 bytes per char, not 1).
Also you commented that one of that columns have a size of 2,000-90,000 characters (!), supposing that column has on average 46k characters we get a size of:
1,200,000 rows x 46k chars x 2 byte/char = 105GB only for the data of that column.
That leaves 15GB for the rest of columns, or about 13KB per row, which is 44 bytes per column, quite low taking into account that almost all are nvarchar(max).
But those are only estimations, for getting the real size of any column use:
select sum(datalength(ColumnName))/1024.00 as SizeKB from TableName
And all of this is only taking into account data, which is not accurate because the database structures needs its size. For example, indexes sum to the total size of a table, roughly they take the sum of the size of the columns included in the index (for example, if you would define and index on the Big Column it would take another 100GB).
You can obtain how many space the whole table uses, using the following script from another question (it will show the size for each table of the DB):
Get size of all tables in database
Check the column UsedSpaceMB, that is the size needed for the data and the indexes, if for some reason the table is using more space (usually because you deleted data) you get that size in UnusedSpaceMB (a bit of unused space is normal).
I have a SQL Server 2008 database that stores millions of rows. There are several NVARCHAR columns that will never exceed the current max length of the column, nor get close to it due to application constraints.
i.e.
The Address NVARCHAR field has a length of 50 characters, but it'll never exceed 32 characters.
Is there a performance benefit or space saving benefit to me reducing the size of the NVARCHAR column to what it's actual max length will be (i.e. in the case of the Address field, 32 characters). Or will it not make a difference since it's a variable length field?
Setting the number of characters in NVARCHAR is mainly for validation purposes. If there is some reason why you don't want the data to exceed 50 characters then the database will enforce that rule for you by not allowing extra data.
If the total row size exceeds a threshold then it can affect performance, so by restricting the length you could benefit by not allowing your row size to exceed that threshold. But in your case, that does not seem to matter.
The reason for this is that SQL Server can fit more rows onto a Page, which results in less disk I/O and more rows can be stored in memory.
Also, the maximum row size in SQL Server is 8KB as that is the size of a page and rows cannot cross page boundaries. If you insert a row that exceeds 8KB, the extra data will be stored in a row overflow page, which will likely have a negative affect on performance.
There is no expected performance or space saving benefit for reducing your n/var/char column definitions to their maximum length. However, there may be other benefits.
The column won't accidentally have a longer value inserted without generating an error (desirable for the "fail fast" characteristic of well-designed systems).
The column communicates to the next developer examining the table something about the data, that aids in understanding. No developer will be confused about the purpose of the data and have to expend wasted time determining if the code's field validation rules are wrong or if the column definition is wrong (as they logically should match).
If your column does need to be extended in length, you can do so with potential consequences ascertained in advance. A professional who is well-versed in databases can use the opportunity to see if upcoming values that will need the new column length will have a negative impact on existing rows or on query performance—as the amount of data per row affects the number of reads required to satisfy queries.
I am wondering about the memory allocation in database whether it assign the memory on the base of DataType of the column or it assign according to the value. Being a .net developer i have the concept that memory allocation is assign on the bases of DataType not on the value. Now I have question how memory allocation is handled on DB side.
FOR EXAMPLE
| id1 | id2 | id3 | name
| NULL | NULL | NULL | James Bond
id1,id2,id3 has null values, what will be the memory size of this row. will it assign the memory to the columns having null values?
Edit
Database Server SQLServer2008 r2
Thanks in advance
Physical storage for SQL Server is in a unit called a "page". There are several structures within a page, those structures are called "records". There are several types of records, the type of record you seem to be asking about is called a "data record".
(There are also several other types of records within a page: index records, forwarding records, ghost records, text records, and other internal record structures (allocation bitmaps, file headers, etc.)
To answer your question, without delving into all those details, and neglecting a discussion of "row compression" and "page compression"...
One part of the record is for the "fixed length" columns, where the columns that are defined with fixed length datatypes are stored (integer, float, date, char(n), etc.). As the name implies, a fixed amount of storage is reserved for each column. Another part of the record is the "variable length" portion, where columns with datatypes of variable length are stored, arranged as an array, a two-byte count of the number of variable length columns, and for each column, a two-byte offset to the end of the column value.
Q: what will be the memory size of this row.
A: In your case, the table with four columns, there will be eight bytes for the record header, some fixed number of bytes for the fixed length columns, three bytes for the NULL bitmap, and a variable amount of storage for the variable length columns.
The "memory size for the row" is really determined by the datatypes of the columns, and for variable length columns, the values that are stored.
(And if any indexes exist, there's also space required in index records.)
Q: will it assign the memory to the columns having null values?
If the columns are fixed length, yes. If columns are variable length, yes, at a minimum, the two byte offset to the end of the value, even if the value is zero length.
SQL Server manages memory in "pages"... In terms of estimating memory requirements, the more pertinent question is "how many rows fit in a page", and "how many pages are required to store my rows?"
A page that contains one data record requires 4KB of memory. A page that contains a dozen data records requires 4KB of memory.
I have found the answer on MSDN
Use Sparse Columns
The SQL Server Database Engine uses the SPARSE keyword in a column definition to optimize the storage of values in that column. Therefore, when the column value is NULL for any row in the table, the values require no storage.
http://msdn.microsoft.com/en-us/library/cc280604.aspx
I have a simple Id column in my database. It can contain information like U001-01 or perhaps something a little more later on.
I am thinking it will be about ten characters and I would like to have an index on this column.
Is there really much to be gained by having this as a VARCHAR(10) instead of a CHAR(10). Note that already my rows will be over 1000 bytes long.
In general, I would recommend using varchar() instead of char(), unless you really want your values padded with spaces at the end. Spaces can make it cumbersome to combine the field with other fields using concatenation. It can also get confusing to remember whether and when the extra spaces matter for comparison purposes.
The additional two bytes of overhead is usually insignificant. After all, if your average value length is less than 8 (n - 2), then the overall storage is still less with a varchar() versus char() representation.
In general, I default to varchar(). If I know a coding is fixed length (US state codes, ISO country codes, 9-digit US zip codes, social security numbers), then I will consider a char() instead.
As my unfinished comment suggests, CHAR uses static type allocation so is more efficient than VARCHAR which uses dynamic type allocation. CHAR columns are space padded, so this must be considered when performing comparisons.
The index is effective if it is VARCHAR(10) or a CHAR(10)
Not a lot to save on 10 from a size perspective
Varchar has 2 bytes of overhead
Char reserves space so changing or later inserting the value will not cause page splits (fragmentation)
The page split takes time and fragmentation slows performance of the table
I typically use char for 40 and under just to avoid page splits
char and varchar (Transact-SQL)
If you use char or varchar, we recommend the following:
Use char when the sizes of the column data entries are consistent.
Use varchar when the sizes of the column data entries vary
considerably.
Use varchar(max) when the sizes of the column data entries vary
considerably, and the size might exceed 8,000 bytes.
I have an issue where I have one column in a database that might be anything from 10 to 10,000 bytes in size. Do you know if PostgreSQL supports sparse data (i.e. will it always set aside the 10,000 bytes fore every entry in the column ... or only the space that is required for each entry)?
Postgres will store long varlena types in an extended storage called TOAST.
In the case of strings, it keep things inline up to 126 bytes (potentially means less 126 characters for multibyte stuff), and then sends it to the external storage.
You can see where the data is stored using psql:
\dt+ yourtable
As an aside, note that from Postgres' standpoint, there's absolutely no difference (with respect to storage) between declaring a column's type as varchar or varchar(large_number) -- it'll be stored in the exact same way. There is, however, a very slight performance penalty in using varchar(large_number) because of the string length check.
use varchar or text types - these use only the space actually required to store the data (plus a small overhead of 2 bytes for each value to store the length)