Table size grown to 120 GB in SQL server - sql-server

I’m using SQL server 2016 and I have table in my database and table size is 120 GB. It has 300 columns and all columns are NVARCHAR(MAX) and it has 12,00,000 records in it. Mostly 100 columns are NULL all the time or it will have a short value. Here my doubt is why 12,00,000 records taken 120 GB, is it because of datatype?
This a Audit table. This will have CDC historical information.On average this table will get inserted 10,000 records per day. Because on this, my database size is increasing and SQL queries are slow. This is an Audit table and not used for any queries.
Please let me know the reason why my table is very big.

Of course, it depends on how you are measuring the size of the table and what other operations occur.
You are observing about 10,000 bytes per record. That does seem large, but there are things you need to consider.
NVARCHAR(MAX) has a minimum size:
nvarchar [ ( n | max ) ]
Variable-length Unicode string data. n defines the string length and
can be a value from 1 through 4,000. max indicates that the maximum
storage size is 2^31-1 bytes (2 GB). The storage size, in bytes, is
two times the actual length of data entered + 2 bytes. The ISO
synonyms for nvarchar are national char varying and national character
varying.
Even the empty fields occupy 2 bytes plus the nullable flag. With 300 fields, that is 600-plus bytes right there (600 + 600 / 8).
You may also have issues with pages that are only partially filled. This depends on how you insert data, the primary key, and system parameters.
And there are other considerations, depending on how you are measuring the size:
How large are the largest fields?
How often are rows occupying multiple pages (each additional page has additional overhead)?
You are using wide characters, so they may seem larger than they seem.
Is your estimate including indexes?
If you are measuring database size, you may be including log tables.
I would suggest that you have your DBA investigate the table to see if there are any obvious problems, such as many pages that are only partially filled.

Edit: updated answer upon clarification on the number of rows that the table really have.
Taking into account that 120GB are 120,000MB you are getting 100KB per row, that is about 330 bytes for each column on average, which its usually quite higher but not for a table with 300 nvarchar(max) columns (note that the nchar and nvarchar types take 2 bytes per char, not 1).
Also you commented that one of that columns have a size of 2,000-90,000 characters (!), supposing that column has on average 46k characters we get a size of:
1,200,000 rows x 46k chars x 2 byte/char = 105GB only for the data of that column.
That leaves 15GB for the rest of columns, or about 13KB per row, which is 44 bytes per column, quite low taking into account that almost all are nvarchar(max).
But those are only estimations, for getting the real size of any column use:
select sum(datalength(ColumnName))/1024.00 as SizeKB from TableName
And all of this is only taking into account data, which is not accurate because the database structures needs its size. For example, indexes sum to the total size of a table, roughly they take the sum of the size of the columns included in the index (for example, if you would define and index on the Big Column it would take another 100GB).
You can obtain how many space the whole table uses, using the following script from another question (it will show the size for each table of the DB):
Get size of all tables in database
Check the column UsedSpaceMB, that is the size needed for the data and the indexes, if for some reason the table is using more space (usually because you deleted data) you get that size in UnusedSpaceMB (a bit of unused space is normal).

Related

Why is VARCHAR slower than CHAR on updating rows?

I was reading a book :
For example, when a column is defined as VARCHAR(25), the maximum number of characters supported is 25, but in practice, the actual number of characters in the string determines the amount of storage. Because storage consumption
for these data types is less than that for fixed-length types, read operations are faster. However, updates might result in row expansion, which might result in data movement outside the current page. Therefore, updates of data having variable-length data types are less efficient than updates of data having fixed-length data types.
I can understand that storage consumption for varchar is less than that for char, but why is it slower than char when updating records? What does row expansion mean and what actually happen when row expanses?
Let's say we have a suburb table which has two columns, zipcode char(5) and name varchar, and let's say we need to update a row's record with zipcode to be 10005, and name to be 'NYC', we only set 3 characters for the name column, shouldn't it be more efficient than zipcode column which requires 5 characters?
Rows are laid out with the fixed size columns first, at fixed offsets from the start of the row. Then (after some important bytes in the middle) the variable sized data is placed at the end. Because it's variable sized, the actual offset to the data cannot be computed for the whole table (like the fixed data) but has to be computed on a row-by-row basis.
And if a varchar(5)1 is storing NYC and is then asked to store NYCX, it may find that there's not a spare byte at the end of NYC - it's being used for another column - so the row has to expand by moving everything after one byte further along to make space for the extra byte.
1I notice in one of your examples you failed to specify a length. Please drill into yourself that that's a bad habit

Netezza table size increased after using CTAS command

I have a large table in Netezza and table size is approx 600 GB.
When I tried to create a new table from my existing table the table size has increased. New table size is 617 GB.
SQL which I used to create new table:
create table new_table_name as select * from old_table_name distribution on (column_name);
generate statistics on new_table_name;
However row count for new table and old table are same.
What could be the reason for increasing table size?
Thanks in advance.
There are two relevant measurements for 'size' of a table: allocated and used size (both in bytes)
_v_table_storage_stat will help you look at both sizes for a given table
For small tables, the allocated size can be many times larger than the used size, and assuming an even distribution of rows, a minimum of 3MB will be allocated on each data slice. I do most of my work on a double rack MAKO system with 480 data slices. Therefore any table with less than 14,4GB are more or less irrelevant for optimization of 'size'
Nevertheless I'll try to explain what you see:
You must realize that
1) all data in Netezza are compressed.
2) Compression is being done for a 'block' of data on each individual dataslice.
3) Compression ratio (the size of data after compression divided by the size before) gets better (smaller) if data in each block share many similarities compared to the most 'mixed' situation imaginable.
4) 'distribute on' and 'organize on' can both affect this. The same can an 'order by' or even a 'group by' in the select statement used when adding data to your table
In my system, I have a very wide table with several 'copies' per day of the bank accounts of our customers. Each copy are 99% identical to the previous one and only things like 'balance' changes.
By distributing on accountID and organizing on AccountID,Timestamp - I saw a 10-15% smaller size. Some data slices had a better effect, because they contained a lot of 'system' account ID's which have a different pattern in the data.
In short:
A) it's perfectly natural
B) don't worry too much about it since:
C) a 'large' table on a Netezza system is not the same as on a 4-core database with too little memory and sloooow disks :)

SQL Server Maximum Row size Vs Varchar(Max) size

I am trying to estimate database size for SQL Server 2008 R2. I have a table with one INTEGER primary key and 39 text columns of type VARCHAR(MAX).
I have searched and found two statements.
A table can contain a maximum of 8,060 bytes per row.
Varchar(max) has a maximum storage capacity of 2 gigabytes.
I am confused to get estimate the size. How can I store 2 gigabytes in each column if there is a limit on row?
I am not database expert may be I am not getting it correctly.
Can anyone explain How to estimate it?
Thank you
In Microsoft SQL Server, data (which includes indexes) are stored in one or more 8k (8192 bytes) "pages". There are different types of pages that can be used to handle various situations (e.g. Data, LOB, Index, AllocationMap, etc) . Each page has a header which is meta-data about that page and what it contains.
Most data is stored in the row itself, and one or more of these rows are in turn stored in a page for "in-row data". Due to the space taken by the row header, the largest a row can be (for "in-row" data) is 8060 bytes.
However, not all data is stored in the row. For certain datatypes, the data can actually be stored on a "LOB data" page while a pointer is left in the "in-row" data:
Legacy / deprecated LOB types that nobody should be using anymore (TEXT, NTEXT, and IMAGE), by default, always store their data on LOB pages and always use a 16-byte pointer to that LOB page.
The newer LOB types (VARCHAR(MAX), NVARCHAR(MAX), VARBINARY(MAX), and XML), by default, will attempt to fit the data directly in the row if it will fit. Else it will store the data on LOB pages and use a pointer of 24 - 72 bytes (depending on the size of the LOB data).
This is how you could store up to 78 GB + 4 bytes (can't forget about the INT Primary Key ;-) in a single row: the max row size will be between 940 bytes ((39 * 24) + 4) and 2812 bytes ((39 * 72) + 4). But again, that is just the maximum range; if the data in each of the 39 VARCHAR(MAX) fields is just 10 bytes, then all of the data will be stored in-row and the row size will be 394 bytes ((39 * 10) + 4).
Given that you have so many variable-length fields (whether they are MAX or not), the only way to estimate the size of future rows is to have a good idea about what data you will be storing in this table. Although, a table with all, or even mostly, MAX datatypes implies that nobody really has any idea what is going to be stored in this table.
Along those lines, it should be pointed out that this is a horribly modeled table / horrible use of MAX datatype fields, and should be refactored.
For more details about how data pages are structured, please see my answer to the following DBA.StackExchange question:
SUM of DATALENGTHs not matching table size from sys.allocation_units
When you use Varchar(MAX), data can be stored within the row(called a page) (if contents are <8000 bytes). If the contents are >8000 bytes, data is stored as a LOB ("off the page"), and only a reference to the actual location is stored within the page. I honestly don't know of any decent way to estimate the size of your entire database, considering the data may be any length in a Varchar(MAX) column.

Is there a benefit to decreasing the size of my NVARCHAR columns

I have a SQL Server 2008 database that stores millions of rows. There are several NVARCHAR columns that will never exceed the current max length of the column, nor get close to it due to application constraints.
i.e.
The Address NVARCHAR field has a length of 50 characters, but it'll never exceed 32 characters.
Is there a performance benefit or space saving benefit to me reducing the size of the NVARCHAR column to what it's actual max length will be (i.e. in the case of the Address field, 32 characters). Or will it not make a difference since it's a variable length field?
Setting the number of characters in NVARCHAR is mainly for validation purposes. If there is some reason why you don't want the data to exceed 50 characters then the database will enforce that rule for you by not allowing extra data.
If the total row size exceeds a threshold then it can affect performance, so by restricting the length you could benefit by not allowing your row size to exceed that threshold. But in your case, that does not seem to matter.
The reason for this is that SQL Server can fit more rows onto a Page, which results in less disk I/O and more rows can be stored in memory.
Also, the maximum row size in SQL Server is 8KB as that is the size of a page and rows cannot cross page boundaries. If you insert a row that exceeds 8KB, the extra data will be stored in a row overflow page, which will likely have a negative affect on performance.
There is no expected performance or space saving benefit for reducing your n/var/char column definitions to their maximum length. However, there may be other benefits.
The column won't accidentally have a longer value inserted without generating an error (desirable for the "fail fast" characteristic of well-designed systems).
The column communicates to the next developer examining the table something about the data, that aids in understanding. No developer will be confused about the purpose of the data and have to expend wasted time determining if the code's field validation rules are wrong or if the column definition is wrong (as they logically should match).
If your column does need to be extended in length, you can do so with potential consequences ascertained in advance. A professional who is well-versed in databases can use the opportunity to see if upcoming values that will need the new column length will have a negative impact on existing rows or on query performance—as the amount of data per row affects the number of reads required to satisfy queries.

Statistical analysis for performance measurement on usage of bigger datatype wherever they were not required at all

If i takes larger datatype where i know i should have taken datatype that was sufficient for possible values that i will insert into a table will affect any performance in sql server in terms of speed or any other way.
eg.
IsActive (0,1,2,3) not more than 3 in
any case.
I know i must take tinyint but due to some reasons consider it as compulsion, i am taking every numeric field as bigint and every character field as nVarchar(Max)
Please give statistics if possible, to let me try to overcoming that compulsion.
I need some solid analysis that can really make someone rethink before taking any datatype.
EDIT
Say, I am using
SELECT * FROM tblXYZ where IsActive=1
How will it get affected. Consider i have 1 million records
Whether it will only have memory wastage only or perforamance wastage as well.
I know more the no of pages more indexing effort is required hence performance will also get affected. But I need some statistics if possible.
You are basically wasting 7 bytes per row on bigint, this will make your tables bigger and thus less will be stored per page so more IO will be needed to bring the same amount of rows back if you used tinyint. If you have a billion row table it will add up
Defining this in statistical terms is somewhat difficult, you can literally do the maths and work out the additional IO overhead.
Let's take a table with 1 million rows, and assume no page padding, compression and use some simple figures.
Given a table whose row size is 100 bytes, that contains 10 tinyints. The number of rows per page (assuming no padding / fragmentation) is 80 (8096 / 100)
By using Bigints, a total of 70 bytes would be added to the row size (10 fields that are 7 bytes more each), giving a row size of 170 bytes, and reducing the rows per page to 47.
For the 1 million rows this results in 12,500 pages for the tinyints, and 21277 pages for the Bigints.
Taking a single disk, reading sequentially, we might expect 300 IOs per second sequential reading, and each read is 8k (e.g. a page).
The respective read times given this theoretical disk is then 41.6 seconds and 70.9 seconds - for a very theoretical scenario of a made up table / row.
That however only applies to a scan, under an index seek, the increase in IO would be relatively small, depending on how many of the bigint's were in the index or clustered key. In terms of backup and restore as mentioned, the data is expanded out and the time loss can be calculated as linear unless compression is at play.
In terms of memory caching, each byte wasted on a page on disk is a byte wasted in the memory, but only applies to the pages in memory - this is were it will get more complex, since the memory wastage will be based on how many of the pages are sitting in the buffer pool, but for the above example it would be broadly 97.6 meg of data vs 166meg of data, and assuming the entire table was scanned and thus in the buffer pool, you would be wasting ~78 megs of memory.
A lot of it comes down to space. Your bigints are going to take 8 times the space (8 byte vs 1 byte for tinyint). Your nvarchar is going to take twice as many bytes as a varchar. Making it max won't affect much of anything.
This will really come into play if you're doing look ups on values. The indexes you will (hopefully) be applying will be much larger.
I'd at least pare it down to int. Bigint is way overkill. But something about this field is calling out to me that something else is wrong with the table as well. Maybe it's just the column name — IsActive sounds like it should be a boolean/bit column.
More than that, though, I'm concerned about your varchar(max) fields. Those will add up even faster.
All the 'wasted' space also comes into play for DR (if you are 4-6 times the size due to poor data type configuration, your recovery can be just as long).
Not only do the larger pages/extents require more IO to serve.... you also decrease your memory cache with the size. With billions of rows, depending on your server you could be dealing with constant memory pressure and clearing memory cache simply because you chose a datatype that was 8 times the size you needed it to be.

Resources