I am trying to estimate database size for SQL Server 2008 R2. I have a table with one INTEGER primary key and 39 text columns of type VARCHAR(MAX).
I have searched and found two statements.
A table can contain a maximum of 8,060 bytes per row.
Varchar(max) has a maximum storage capacity of 2 gigabytes.
I am confused to get estimate the size. How can I store 2 gigabytes in each column if there is a limit on row?
I am not database expert may be I am not getting it correctly.
Can anyone explain How to estimate it?
Thank you
In Microsoft SQL Server, data (which includes indexes) are stored in one or more 8k (8192 bytes) "pages". There are different types of pages that can be used to handle various situations (e.g. Data, LOB, Index, AllocationMap, etc) . Each page has a header which is meta-data about that page and what it contains.
Most data is stored in the row itself, and one or more of these rows are in turn stored in a page for "in-row data". Due to the space taken by the row header, the largest a row can be (for "in-row" data) is 8060 bytes.
However, not all data is stored in the row. For certain datatypes, the data can actually be stored on a "LOB data" page while a pointer is left in the "in-row" data:
Legacy / deprecated LOB types that nobody should be using anymore (TEXT, NTEXT, and IMAGE), by default, always store their data on LOB pages and always use a 16-byte pointer to that LOB page.
The newer LOB types (VARCHAR(MAX), NVARCHAR(MAX), VARBINARY(MAX), and XML), by default, will attempt to fit the data directly in the row if it will fit. Else it will store the data on LOB pages and use a pointer of 24 - 72 bytes (depending on the size of the LOB data).
This is how you could store up to 78 GB + 4 bytes (can't forget about the INT Primary Key ;-) in a single row: the max row size will be between 940 bytes ((39 * 24) + 4) and 2812 bytes ((39 * 72) + 4). But again, that is just the maximum range; if the data in each of the 39 VARCHAR(MAX) fields is just 10 bytes, then all of the data will be stored in-row and the row size will be 394 bytes ((39 * 10) + 4).
Given that you have so many variable-length fields (whether they are MAX or not), the only way to estimate the size of future rows is to have a good idea about what data you will be storing in this table. Although, a table with all, or even mostly, MAX datatypes implies that nobody really has any idea what is going to be stored in this table.
Along those lines, it should be pointed out that this is a horribly modeled table / horrible use of MAX datatype fields, and should be refactored.
For more details about how data pages are structured, please see my answer to the following DBA.StackExchange question:
SUM of DATALENGTHs not matching table size from sys.allocation_units
When you use Varchar(MAX), data can be stored within the row(called a page) (if contents are <8000 bytes). If the contents are >8000 bytes, data is stored as a LOB ("off the page"), and only a reference to the actual location is stored within the page. I honestly don't know of any decent way to estimate the size of your entire database, considering the data may be any length in a Varchar(MAX) column.
Related
I’m using SQL server 2016 and I have table in my database and table size is 120 GB. It has 300 columns and all columns are NVARCHAR(MAX) and it has 12,00,000 records in it. Mostly 100 columns are NULL all the time or it will have a short value. Here my doubt is why 12,00,000 records taken 120 GB, is it because of datatype?
This a Audit table. This will have CDC historical information.On average this table will get inserted 10,000 records per day. Because on this, my database size is increasing and SQL queries are slow. This is an Audit table and not used for any queries.
Please let me know the reason why my table is very big.
Of course, it depends on how you are measuring the size of the table and what other operations occur.
You are observing about 10,000 bytes per record. That does seem large, but there are things you need to consider.
NVARCHAR(MAX) has a minimum size:
nvarchar [ ( n | max ) ]
Variable-length Unicode string data. n defines the string length and
can be a value from 1 through 4,000. max indicates that the maximum
storage size is 2^31-1 bytes (2 GB). The storage size, in bytes, is
two times the actual length of data entered + 2 bytes. The ISO
synonyms for nvarchar are national char varying and national character
varying.
Even the empty fields occupy 2 bytes plus the nullable flag. With 300 fields, that is 600-plus bytes right there (600 + 600 / 8).
You may also have issues with pages that are only partially filled. This depends on how you insert data, the primary key, and system parameters.
And there are other considerations, depending on how you are measuring the size:
How large are the largest fields?
How often are rows occupying multiple pages (each additional page has additional overhead)?
You are using wide characters, so they may seem larger than they seem.
Is your estimate including indexes?
If you are measuring database size, you may be including log tables.
I would suggest that you have your DBA investigate the table to see if there are any obvious problems, such as many pages that are only partially filled.
Edit: updated answer upon clarification on the number of rows that the table really have.
Taking into account that 120GB are 120,000MB you are getting 100KB per row, that is about 330 bytes for each column on average, which its usually quite higher but not for a table with 300 nvarchar(max) columns (note that the nchar and nvarchar types take 2 bytes per char, not 1).
Also you commented that one of that columns have a size of 2,000-90,000 characters (!), supposing that column has on average 46k characters we get a size of:
1,200,000 rows x 46k chars x 2 byte/char = 105GB only for the data of that column.
That leaves 15GB for the rest of columns, or about 13KB per row, which is 44 bytes per column, quite low taking into account that almost all are nvarchar(max).
But those are only estimations, for getting the real size of any column use:
select sum(datalength(ColumnName))/1024.00 as SizeKB from TableName
And all of this is only taking into account data, which is not accurate because the database structures needs its size. For example, indexes sum to the total size of a table, roughly they take the sum of the size of the columns included in the index (for example, if you would define and index on the Big Column it would take another 100GB).
You can obtain how many space the whole table uses, using the following script from another question (it will show the size for each table of the DB):
Get size of all tables in database
Check the column UsedSpaceMB, that is the size needed for the data and the indexes, if for some reason the table is using more space (usually because you deleted data) you get that size in UnusedSpaceMB (a bit of unused space is normal).
I have a SQL Server 2008 database that stores millions of rows. There are several NVARCHAR columns that will never exceed the current max length of the column, nor get close to it due to application constraints.
i.e.
The Address NVARCHAR field has a length of 50 characters, but it'll never exceed 32 characters.
Is there a performance benefit or space saving benefit to me reducing the size of the NVARCHAR column to what it's actual max length will be (i.e. in the case of the Address field, 32 characters). Or will it not make a difference since it's a variable length field?
Setting the number of characters in NVARCHAR is mainly for validation purposes. If there is some reason why you don't want the data to exceed 50 characters then the database will enforce that rule for you by not allowing extra data.
If the total row size exceeds a threshold then it can affect performance, so by restricting the length you could benefit by not allowing your row size to exceed that threshold. But in your case, that does not seem to matter.
The reason for this is that SQL Server can fit more rows onto a Page, which results in less disk I/O and more rows can be stored in memory.
Also, the maximum row size in SQL Server is 8KB as that is the size of a page and rows cannot cross page boundaries. If you insert a row that exceeds 8KB, the extra data will be stored in a row overflow page, which will likely have a negative affect on performance.
There is no expected performance or space saving benefit for reducing your n/var/char column definitions to their maximum length. However, there may be other benefits.
The column won't accidentally have a longer value inserted without generating an error (desirable for the "fail fast" characteristic of well-designed systems).
The column communicates to the next developer examining the table something about the data, that aids in understanding. No developer will be confused about the purpose of the data and have to expend wasted time determining if the code's field validation rules are wrong or if the column definition is wrong (as they logically should match).
If your column does need to be extended in length, you can do so with potential consequences ascertained in advance. A professional who is well-versed in databases can use the opportunity to see if upcoming values that will need the new column length will have a negative impact on existing rows or on query performance—as the amount of data per row affects the number of reads required to satisfy queries.
There is a column field1 varchar(32)
If I just store 'ABC' in this field, it takes 3 bytes. When SQL Server access this column and caches it in memory, how much memory does it take? 3 bytes or 32 bytes, and how to prove your answer?
Thanks in advance
SQL Server will not cache the column. It will cache the entire page that happen to contain the value, so it will always cache 8192 bytes. The location of the page is subject to things like in-row vs. row-overflow storage and whether the column is sparse or not.
Now a better question would be how much does such a value occupy in the page? The answer is again not straight forward because the value could be stored uncompressed, row-compressed, page compressed or column-compressed. Is true that row compression has no effect on varchar fields, but page compression does.
Now for a straight forward way of answering how much storage does a varchar(32) type value occupy in-row in a non-compressed table the best resource is Inside the Storage Engine: Anatomy of a record. After you read Paul Randal's article you will be able to answer the question, and also prove the answer.
In addition you must consider any secondary index that has this column as a key or includes this column.
50% of their declared size on average. There is a lot documentation on this. Here is where I pulled this from How are varchar values stored in a SQL Server database?
I have an issue where I have one column in a database that might be anything from 10 to 10,000 bytes in size. Do you know if PostgreSQL supports sparse data (i.e. will it always set aside the 10,000 bytes fore every entry in the column ... or only the space that is required for each entry)?
Postgres will store long varlena types in an extended storage called TOAST.
In the case of strings, it keep things inline up to 126 bytes (potentially means less 126 characters for multibyte stuff), and then sends it to the external storage.
You can see where the data is stored using psql:
\dt+ yourtable
As an aside, note that from Postgres' standpoint, there's absolutely no difference (with respect to storage) between declaring a column's type as varchar or varchar(large_number) -- it'll be stored in the exact same way. There is, however, a very slight performance penalty in using varchar(large_number) because of the string length check.
use varchar or text types - these use only the space actually required to store the data (plus a small overhead of 2 bytes for each value to store the length)
I have a column that is declared as nvarchar(4000) and 4000 is SQL Server's limit on the length of nvarchar. it would not be used as a key to sort rows.
Are there any implication that i should be aware of before setting the length of the nvarchar field to 4000?
Update
I'm storing an XML, an XML Serialized Object to be exact. I know this isn't favorable but we will most likely never need to perform queries on it and this implementation dramatically decreases development time for certain features that we plan on extending. I expect the XML data to be 1500 characters long on average but there can be those exceptions where it can be longer than 4000. Could it be longer than 4000 characters? It could be but in very rare occasions, if it ever happens. Is this application mission critical? Nope, not at all.
SQL Server has three types of storage: in-row, LOB and Row-Overflow, see Table and Index Organization. The in-row storage is fastest to access. LOB and Row-Overflow are similar to each other, both slightly slower than in-row.
If you have a column of NVARCHAR(4000) it will be stored in row if possible, if not it will be stored in the row-overflow storage. Having such a column does not necesarily indicate future performance problems, but it begs the question: why nvarchar(4000)? Is your data likely to be always near 4000 characters long? Can it be 4001, how will your applicaiton handle it in this case? Why not nvarchar(max)? Have you measured performance and found that nvarchar(max) is too slow for you?
My recommendation would be to either use a small nvarchar length, appropiate for the real data, or nvarchar(max) if is expected to be large. nvarchar(4000) smells like unjustified and not tested premature optimisation.
Update
For XML, use the XML data type. It has many advantages over varchar or nvarchar, like the fact that it supports XML indexes, it supports XML methods and can actually validate the XML for a compliance to a specific schema or at least for well-formed XML compliance.
XML will be stored in the LOB storage, outside the row.
Even if the data is not XML, I would still recommend LOB storage (nvarchar(max)) for something of a length of 1500. There is a cost associated with retrieving the LOB stored data, but the cost is more than compensated by macking the table narrower. The width of a table row is a primary factor of performance, because wider tables fit less rows per page, so any operation that has to scan a range of rows or the entire table needs to fetch more pages into memory, and this shows up in the query cost (is actualy the driving factor of the overall cost). A LOB stored column only expands the size of the row with the width of a 'page id', which is 8 bytes if I remember correctly, so you can get much better density of rows per page, hence faster queries.
Are you sure that you'll actually need as many as 4000 characters? If 1000 is the practical upper limit, why not set it to that? Conversely, if you're likely to get more than 4000 bytes, you'll want to look at nvarchar(max).
I like to "encourage" users not use storage space too freely. The more space required to store a given row, the less space you can store per page, which potentially results in more disk I/O when the table is read or written to. Even though only as many bytes of data are stored as are necessary (i.e. not the full 4000 per row), whenever you get a bit more than 2000 characters of nvarchar data, you'll only have one row per page, and performance can really suffer.
This of course assumes you need to store unicode (double-byte) data, and that you only have one such column per row. If you don't, drop down to varchar.
Do you certainly need nvarchar or can you go with varchar? The limitation applies mainly to sql server 2k. Are you using 2k5 / 2k8 ?