Unaccounted for database size - sql-server

I currently have a database that is 20GB in size.
I've run a few scripts which show on each tables size (and other incredibly useful information such as index stuff) and the biggest table is 1.1 million records which takes up 150MB of data. We have less than 50 tables most of which take up less than 1MB of data.
After looking at the size of each table I don't understand why the database shouldn't be 1GB in size after a shrink. The amount of available free space that SqlServer (2005) reports is 0%. The log mode is set to simple. At this point my main concern is I feel like I have 19GB of unaccounted for used space. Is there something else I should look at?
Normally I wouldn't care and would make this a passive research project except this particular situation calls for us to do a backup and restore on a weekly basis to put a copy on a satellite (which has no internet, so it must be done manually). I'd much rather copy 1GB (or even if it were down to 5GB!) than 20GB of data each week.
sp_spaceused reports the following:
Navigator-Production 19184.56 MB 3.02 MB
And the second part of it:
19640872 KB 19512112 KB 108184 KB 20576 KB
while I've found a few other scripts (such as the one from two of the server database size questions here, they all report the same information either found above or below).
The script I am using is from SqlTeam. Here is the header info:
* BigTables.sql
* Bill Graziano (SQLTeam.com)
* graz#<email removed>
* v1.11
The top few tables show this (table, rows, reserved space, data, index, unused, etc):
Activity 1143639 131 MB 89 MB 41768 KB 1648 KB 46% 1%
EventAttendance 883261 90 MB 58 MB 32264 KB 328 KB 54% 0%
Person 113437 31 MB 15 MB 15752 KB 912 KB 103% 3%
HouseholdMember 113443 12 MB 6 MB 5224 KB 432 KB 82% 4%
PostalAddress 48870 8 MB 6 MB 2200 KB 280 KB 36% 3%
The rest of the tables are either the same in size or smaller. No more than 50 tables.
Update 1:
- All tables use unique identifiers. Usually an int incremented by 1 per row.
I've also re-indexed everything.
I ran the dbcc shrink command as well as updating the usage before and after. And over and over. An interesting thing I found is that when I restarted the server and confirmed no one was using it (and no maintenance procs are running, this is a very new application -- under a week old) and when I went to run the shrink, every now and then it would say something about data changed. Googling yielded too few useful answers with the obvious not applying (it was 1am and I disconnected everyone, so it seems impossible that was really the case). The data was migrated via C# code which basically looked at another server and brought things over. The quantity of deletes, at this point in time, are probably under 50k in rows. Even if those rows were the biggest rows, that wouldn't be more than 100M I would imagine.
When I go to shrink via the GUI it reports 0% available to shrink, indicating that I've already gotten it as small as it thinks it can go.
Update 2:
sp_spaceused 'Activity' yields this (which seems right on the money):
Activity 1143639 134488 KB 91072 KB 41768 KB 1648 KB
Fill factor was 90.
All primary keys are ints.
Here is the command I used to 'updateusage':
DBCC UPDATEUSAGE(0);
Update 3:
Per Edosoft's request:
Image 111975 2407773 19262184
It appears as though the image table believes it's the 19GB portion.
I don't understand what this means though.
Is it really 19GB or is it misrepresented?
Update 4:
Talking to a co-worker and I found out that it's because of the pages, as someone else here has also state the potential for that. The only index on the image table is a clustered PK. Is this something I can fix or do I just have to deal with it?
The regular script shows the Image table to be 6MB in size.
Update 5:
I think I'm just going to have to deal with it after further research. The images have been resized to be roughly 2-5KB each and on a normal file system doesn't consume much space but on SqlServer it seems to consume considerably more. The real answer, in the long run, will likely be separating that table in to another partition or something similar.

Try this query:
SELECT object_name(object_id) AS name, rows, total_pages,
total_pages * 8192 / 1024 as [Size(Kb)]
FROM sys.partitions p
INNER JOIN sys.allocation_units a
ON p.partition_id = a.container_id

You may also want to update the usage in the systables before you run the query to make sure that they are accurate.
DECLARE #DbName NVARCHAR(128)
SET #DbName = DB_NAME(DB_ID())
DBCC UPDATEUSAGE(#DbName)

what is the fill factor you're using in your reindexing? it has to be high. from 90-100% depending on the PK datatype.
if your fill factor is low then you'll have a lot of half empty pages which can't be shrunk down.

Did you try the dbcc command to shrink the catalog? If you transfer all data to an empty catalog, is it also 20GB?
A database uses a page-based file system, so you might be running into a lot of slack (empty space between pages) due to heavy row removal: if the dbms expects rows to be inserted at that spot, it might be better to leave the spots open. Do you use unique_identifier based PK's which have a clustered index?

you could try doing a database vacuum, this can often yield large space improvements if you have never done it before.
hope this helps.

Have you checked the stats under the "Shrink Database" dialog? In SQL Server Management Studio (2005 / 2008), right-click the database, click Tasks -> Shrink -> Database. That'll show you how much space is allocated to the DB, and how much of that allocated space is currently unused.

Have you ensured that the space isn't being consumed by your transaction log? If you're in full recovery mode, the t-log won't be shrinkable until you perform a transaction log backup.

Related

SQL Server is creating too many LOB pages

I have a strange situation with a SQL Server database where to actual data in the table is roughly 320 MiB. This is determined by summing up the DATALENGTH of all the columns. This will ignore fragmentation, index space and other SQL Server internal overhead. The problem though is that the table is roughly 40 GiB in size and it's growing at an alarming rate, very disproportionate to the amount of data in bytes or rows that was inserted.
I used the sys.dm_db_index_physical_stats function to look at the physical data and the roughly 40 GiB of data is tied up in LOB_DATA.
The most of the 320 MiB that makes up the table contents is of type ntext. Now, my question is how come SQL Server has allocated 40 GiB of LOB_DATA when there's only roughly 310 MiB ntext data.
Will the problem go away if we convert the column to nvarchar(max)? Are there any storage engine specifics regarding ntext and LOB_DATA that is causing the LOB_DATA pages to not be reclaimed? Why is it groing at a so disproportionate rate with regards to the amount of changes that are being made?

How to efficiently store this big amount of data? Database or what?

I have to do an application that will check the changes of 35 items each second. Each item have 3 values that will fit into 5 bytes each one, so 15 bytes for item. Values will not change each second but there isn't a pattern, maybe they change continuously or they stall for a while ...
So I did a small calculation and I got that storing all the fields each second on a relational database (SQL) I will have:
35 * 15 bytes * 60 seconds * 60 minutes * 24 hours * 365 = 16.5 Gb a year.
This is too much for an SQL database. What would you do to reduce the size of the data? I was thinking on storing the data only when there is a change but then you need to store when the change was done and if the data changes too often this can require more space than the other approach.
I don't know if there are other repositories other than SQL databases that fit better with my requirements.
What do you think?
EDIT: More information.
There is no relation between data other than the one I could create to save space. I just need to store this data and query it. The data can look like (putting it all in one table and saving the data each second):
Timestamp Item1A Item1B Item1C Item2A Item2B ....
whatever 1.33 2.33 1.04 12.22 1.22
whatever 1.73 2.33 1.04 12.23 1.32
whatever 1.23 2.33 1.34 12.22 1.22
whatever 1.33 2.31 1.04 12.22 1.21
I can feel that must be better solutions rather than this aproach ...
EDIT 2:
I usually will query the data about the values of an Item over the time, usually I won't query data from more than one Item ...
This is too much for an SQL database
Since when is it too much?
That's really peanuts for almost any RDBMS out there (~17GB of data every year).
MySQL can do it, so can PostgreSQL, Firebird and plenty others but not the likes of Sqlite. I'd pick PostgreSQL myself.
Having SQL databases with hundreds of TB of data is not that uncommon these days, so 17GB is nothing to think about, really. Let alone 170GB in 10 years (with the machines of the time).
Even if it gets to 30GB a year to account for other data and indexes, that's still OK for an SQL database.
Edit
Considering your structure, that looks to me solid, the minimal things that you need are already there and there are no extras that you don't need.
You can't get any better than that, without using tricks that have more disadvantages than advantages.
I'm currently considering using compressed files instead of SQL databases. I will keep the post upgraded with the info I get.

Why is my Firebird database so huge for the amount of data it's holding?

I've been playing around with database programming lately, and I noticed something a little bit alarming.
I took a binary flat file saved in a proprietary, non-compressed format that holds several different types of records, built schemas to represent the same records, and uploaded the data into a Firebird database. The original flat file was about 7 MB. The database is over 70 MB!
I can understand that there's some overhead to describe the tables themselves, and I've got a few minimal indices (mostly PKs) and FKs on various tables, and all that is going to take up some space, but a factor of 10 just seems a little bit ridiculous. Does anyone have any ideas as to what could be bloating up this database so badly, and how I could bring the size down?
From Firebird FAQ:
Many users wonder why they don't get their disk space back when they delete a lot of records from database.
The reason is that it is an expensive operation, it would require a lot of disk writes and memory - just like doing refragmentation of hard disk partition. The parts of database (pages) that were used by such data are marked as empty and Firebird will reuse them next time it needs to write new data.
If disk space is critical for you, you can get the space back by doing backup and then restore. Since you're doing the backup to restore right away, it's wise to use the "inhibit garbage collection" or "don't use garbage collection" switch (-G in isql), which will make backup go A LOT FASTER. Garbage collection is used to clean up your database, and as it is a maintenance task, it's often done together with backup (as backup has to go throught entire database anyway). However, you're soon going to ditch that database file, and there's no need to clean it up.
Gstat is the tool to examine table sizes etc, maybe it will give you some hints what's using space.
In addition, you may also have multiple snapshots or other garbage in database file, it depends on how you add data to the database. The database file never shrinks automatically, but backup/restore cycle gets rid of junk and empty space.
Firebird fill pages in some factor not full.
e.g. db page can contain 70% of data and 30% free space to speed up future record updates, deletes without moving to new db page.
CONFIGREVISIONSTORE (213)
Primary pointer page: 572, Index root page: 573
Data pages: 2122, data page slots: 2122, average fill: 82%
Fill distribution:
0 - 19% = 1
20 - 39% = 0
40 - 59% = 0
60 - 79% = 79
80 - 99% = 2042
The same is for indexes.
You can see how really db size is when you do backup and restore with option
-USE_ALL_SPACE
then database will be restored without this space preservation.
You must know also that not only pages with data are allocated but also some pages are preallocated (empty) for future fast use without expensive disc allocation and fragmentation.
as "Peter G." say - database is much more then flat file and is optimized to speed up thinks.
and as "Harriv" say - you can get details about database file with gstat
use command like gstat -
here are details about its output

how many rows can a be executed at one time unless to get time out

i work on sql 2005 server
I have almost 350 000 insert scripts.. The insert script has 10 columns to be inserted.
So how many rows should I select to be executed at one click. "Execute" click..
Please tell me an average number according to an average system configuration..
Win XP
Cor 2 Duo
3,66 Gb ram
Ok, lets get some things straight here:
Win XP Cor 2 Duo 3,66 Gb ram
Not average but outdated. On top it totally missed the most important nubmer for a db, which is speed/number of discs.
i work on sql 2005 server I have
almost 350 000 insert scripts..
I seriously doubt you haver 350.000 insert SCRIPTS. THIs would be 350.000 FILES that contain insert commands. This is a lot of files.
The insert script has 10 columns to be
inserted.
I order a pizza. How much fuel does my car require per km? Same relation. 10 columns is nice, but you dont say how many insert commands your scripts contain.
So, at the end the only SENSIBLE interpretation is you have to insert 350.000 rows, and try to do it from a program (i.e. there are no scripts to start with), but this is pretty much absolutely NOT what you say.
So how many rows should I select to be
executed at one click
How many pizzas should I order with one telephone? THe click here is irrelevant. It woud also not get faster when you use a command line program to do the isnerts.
The question is how to get inserts into the db fastest.
For normal SQL:
Batch the inserts. Like 50 or 100 into one statement (yes, you can write more than one insert into one command).
Submit them interleaved async, preparing the next statement while the prevous one executes.
This is very flexible as yuo can do real sql etc.
for real mass inserts:
Forget the idea of writing insertsstatements. Prepare the data properly as per table structure, use SqlBulkCopy to mass insert them.
Less flexible - but a LOT faster.
The later approach on my SMALL (!) database computer would handle this in about 3-5 seconds when the fields are small (a field dan be a 2gb binary data thing, you know). I handle about 80.000 row isnerts per second without a lot of optimization, but i have small and a little less fields. This is 4 processor cores (irrelvant, they never get busy), 8gb RAM (VERY small for a datbase server, irrelevant as well in this context), and 6 vlociraptors for the data in a Raid 10 (again, a small configuration for a database, b ut very relevant). I get a peak insert in the 150mb per second range here in activity monitor. I will do a lot of optimization here, as i open /close a db connection at the moment every 20.000 items... bad batching.
But then, you dont seem to have a database system at all, just a database installed on a low end workstation, an this means your IO is going to be REALLY slow compared to database servers, and insert speed / update speed is IO bound. Desktop discs suck, and you have data AND logs on the same discs.
But.... at the end you dont really say us anything about your problem.
And... the timeout CAN be set programmatically on the connection object.
I'm pretty sure the timeout can be set by the user by going Server Properties -> Connections -> Remote query timeout. If you set this sufficiently high (or to 0 which should mean it never times out) then you can run as many scripts as you like.
Obviously this is only ok if the database is not yet live - and you're simply needing to populate. If the data is coming from another MS SQL Server however you might just want to take a Full backup and restore - this will be both simpler and quicker.
This may be of help.
The general rule of thumb is to not exceed 0.1 seconds per UI operation for excellent performance. You are going to need to benchmark to find out what that is.

Why is PostgreSQL eating up all my precious HD space?

I just finished transferring as much link-structure data concerning wikipedia (English) as I could. Basically, I downloaded a bunch of SQL dumps from wikipedia's latest dump repository. Since I am using PostgreSQL instead of MySQL, I decided to load all these dumps into my db using pipeline shell commands.
Anyway, one of these tables has 295 million rows: the pagelinks table; it contains all intra-wiki hyperlinks. From my laptop, using pgAdmin III, I sent the following command to my database server (another computer):
SELECT pl_namespace, COUNT(*) FROM pagelinks GROUP BY (pl_namespace);
Its been at it for an hour or so now. The thing is that the postmaster seems to be eating up more and more of my very limited HD space. I think it ate about 20 GB as of now. I had previously played around with the postgresql.conf file in order to give it more performance flexibility (i.e. let it use more resources) for it is running with 12 GB of RAM. I think I basically quadrupled most bytes and such related variables of this file thinking it would use more RAM to do its thing.
However, the db does not seem to use much RAM. Using the Linux system monitor, I am able to see that the postmaster is using 1.6 GB of shared memory (RAM). Anyway, I was wondering if you guys could help me better understand what it is doing for it seems that I really do not understand how PostgreSQL uses HD resources.
Concerning the metastructure of wikipedia databases, they provide a good schema that may be of use or even but of interest to you.
Feel free to ask me for more details, thx.
It's probably the GROUP BY that's causing the problem. In order to do grouping, the database has to sort the rows to put duplicate items together. An index probably won't help. A back-of-the-envelope calculation:
Assuming each row takes 100 bytes of space, that's 29,500,000,000 bytes, or about 30GB of storage. It can't fit all that in memory, so your system is thrashing, which slows operations down by a factor of 1000 or more. Your HD space may be disappearing into swap space, if it's using swap files.
If you only need to do this calculation once, try breaking it apart into smaller subsets of the data. Assuming pl_namespace is numeric and ranges from 1-295million, try something like this:
SELECT pl_namespace, COUNT(*)
FROM pagelinks
WHERE pl_namespace between 1 and 50000000
GROUP BY (pl_namespace);
Then do the same for 50000001-100000000 and so forth. Combine your answers together using UNION or simply tabulate the results with an external program. Forget what I wrote about an index not helping GROUP BY; here, an index will help the WHERE clause.
What exactly is claiming that it's only taking 9.5MB of RAM? That sounds unlikely to me - the shared memory almost certainly is RAM which is being shared between different Postgres processes. (From what I remember, each client ends up as a separate process, although it's been a while so I could be very wrong.)
Do you have an index on the pl_namespace column? If there's an awful lot of distinct results, I could imagine that query being pretty heavy on a 295 million row table with no index. Having said that, 10GB is an awful lot to swallow. Do you know which files it's writing to?
Ok so here is the gist of it:
the GROUP BY clause made the index' invalid, so the postmaster (postgresql server process) decided to create a bunch of tables (23GB of tables) that were located in the directory $PGDATA/base/16384/pgsql_tmp.
When modifying the postgresql.conf file, I had given permission to postgreSQL to use 1.6 GB of RAM (which I will now double for it has access to 11.7 GB of RAM); the postmaster process was indeed using up 1.6 GB of RAM, but that wasn't enough, thus the pgsql_tmp directory.
As was pointed out by Barry Brown, since I was only executing this SQL command to get some statistical information about the distribution of the links among the pagelinks.namespaces, I could have queried a subset of the 296 million pagelinks (this is what they do for surveys).
When the command returned the result set, all temporary tables were automatically deleted as if nothing had happened.
Thx for your help guys!

Resources