I've been trying to estimate the size of an Access table with a certain number of records.
It has 4 Longs (4 bytes each), and a Currency (8 bytes).
In theory: 1 Record = 24 bytes, 500,000 = ~11.5MB
However, the accdb file (even after compacting) increases by almost 30MB (~61 bytes per record). A few extra bytes for padding wouldn't be so bad, but 2.5X seems a bit excessive - even for Microsoft bloat.
What's with the discrepancy? The four longs are compound keys, would that matter?
This is the result of my tests, all conducted with an A2003 MDB, not with A2007 ACCDB:
98,304 IndexTestEmpty.mdb
131,072 IndexTestNoIndexesNoData.mdb
11,223,040 IndexTestNoIndexes.mdb
15,425,536 IndexTestPK.mdb
19,644,416 IndexTestPKIndexes1.mdb
23,838,720 IndexTestPKIndexes2.mdb
24,424,448 IndexTestPKCompound.mdb
28,041,216 IndexTestPKIndexes3.mdb
28,655,616 IndexTestPKCompoundIndexes1.mdb
32,849,920 IndexTestPKCompoundIndexes2.mdb
37,040,128 IndexTestPKCompoundIndexes3.mdb
The names should be pretty self-explanatory, I think. I used an append query with Rnd() to append 524,288 records of fake data, which made the file 11MBs. The indexes I created on the other fields were all non-unique. But if you see the compound 4-column index increased the size from 11MBs (no indexes) to well over 24MBs. A PK on the first column only increased the size only from 11MBs to 15.4MBs (using fake MBs, of course, i.e., like hard drive manufacturers).
Notice how each single-column index added approximately 4MBs to the file size. If you consider that 4 columns with no indexes totalled 11MBs, that seems about right based on my comment above, i.e., that each index should increase the file size by about the amount of data in the field being indexed. I am surprised that the clustered index did this, too -- I thought that the clustered index would use less space, but it doesn't.
For comparison, a non-PK (i.e., non-clustered) unique index on the first column, starting from IndexTestNoIndexes.mdb is exactly the same size as the database with the first column as the PK, so there's no space savings from the clustered index at all. On the off chance that perhaps the ordinal position of the indexed field might make a difference, I also tried a unique index on the second column only, and this came out exactly the same size.
Now, I didn't read your question carefully, and omitted the Currency field, but if I add that to the non-indexed table and the table with the compound index and populate it with random data, I get this:
98,304 IndexTestEmpty.mdb
131,072 IndexTestNoIndexesNoData.mdb
11,223,040 IndexTestNoIndexes.mdb
15,425,536 IndexTestPK.mdb
15,425,536 IndexTestIndexUnique2.mdb
15,425,536 IndexTestIndexUnique1.mdb
15,482,880 IndexTestNoIndexes+Currency.mdb
19,644,416 IndexTestPKIndexes1.mdb
23,838,720 IndexTestPKIndexes2.mdb
24,424,448 IndexTestPKCompound.mdb
28,041,216 IndexTestPKIndexes3.mdb
28,655,616 IndexTestPKCompoundIndexes1.mdb
28,692,480 IndexTestPKCompound+Currency.mdb
32,849,920 IndexTestPKCompoundIndexes2.mdb
37,040,128 IndexTestPKCompoundIndexes3.mdb
The points of comparison are:
11,223,040 IndexTestNoIndexes.mdb
15,482,880 IndexTestNoIndexes+Currency.mdb
24,424,448 IndexTestPKCompound.mdb
28,692,480 IndexTestPKCompound+Currency.mdb
So, the currency field added another 4.5MBs, and its index added another 4MBs. And if I add non-unique indexes to the 2nd, 3rd and 4th long fields, the database 41,336,832, and increase in size of just under 12MBs (or ~4MBs per additional index).
So, this basically replicates your results, no? And I ended up with the same file sizes, roughly speaking.
The answer to your question is INDEXES, though there is obviously more overhead in the A2007 ACCDB format, since I saw an increase in size of only 20MBs, not 30MBs.
One thing I did notice was that I could implement an index that would make the file larger, then delete the index and compact, and it would return to exactly the same file size as it had before, so you should be able to take a single copy of your database and experiment with what removing the indexes does to your file size.
Related
I am getting below error. How to fix it? Why am I getting this error?
Message operation failed. The index entry of length 1526 bytes for the index ix_Emp_no_1 exceeds the maximum length of 900 bytes.
Maximum Bytes per index key
900 bytes for a clustered index.
1,700 for a nonclustered index
Result and meaning:
When you have a nchar(450) because Unicode characters occupy 2 bytes it will equal 900 bytes.
you can ignore Unicode and use char(900) instead.
for variable strings types like varchar and nvarchar it seems to be working in the same manner.
if you have composite key containing multiple columns the sum of them have to have lower bytes than this limitations.
Note : sometimes the SSMS designer does not let you create index larger than 900 bytes (till 1700 bytes) but it is officially supported. you just need to generate the script and run it yourself.
source : https://learn.microsoft.com/en-us/sql/sql-server/maximum-capacity-specifications-for-sql-server?view=sql-server-2017
I have never seen this, but I found this article that may help you deal with your situation.
https://blogs.msdn.microsoft.com/bartd/2011/01/06/living-with-sqls-900-byte-index-key-length-limit/
It sounds like you have a key that is over 900 bytes, a large text field maybe. So you need to find a different way to identify that row. Try to index a different column and include the problem column.
I have some data which I will be putting in the database. Say I make a field like "coupondetail text(10000)" which will store the coupon detail, now consider that not all coupondetail will be 10,000 chars long. I m curious to know how much space will the column take in the database when the coupondetail text is lesser than 10,000 say 1000 chars?
sqlite does not care much how you declare your column types and ignores any maximum length specified. The declared type is just a hint; any non-INTEGER PRIMARY KEY column can contain any type.
The size taken up in the database file depends on the values you put in. In the record format, strings are stored as length followed by string data. No empty space is necessarily left there.
I have a set of data which has a name, some sub values and then a associative numeric value. For example:
James Value1 Value2 "1.232323/1.232334"
Jim Value1 Value2 "1.245454/1.232999"
Dave Value1 Value2 "1.267623/1.277777"
There will be around 100,000 entries like this stored in either a file or database. I would like to know, what is the quickest way of being able to return the results which match a search, along with their associated numeric value.
For example, a query of "J" would return both James and Jim results which the numeric values in the last column.
I've heard people mention binary tree searching, dictionary searching, indexed searching. I have no idea which is a good route to peruse.
This is a poorly characterized problem. As with many optimization problems, there are trade-offs in resources. If you truly want the fastest response possible, then a likely approach is to compile all possible searches into a table of prepared results, so that, given a search key, you can look the search key up in the table and return the result.
Assuming your character set is limited to A-Z and a-z, a table with an entry for each search key from 0 to 4 characters will use a modest amount of memory by today’s standards. Each table entry merely needs to have two values in it: The start and end positions in a list of the numeric values. (Compile the list in this way: Sort the records by the name field. Extract just the numeric values from the records, maintaining the order, putting them in a list. Any search key must return a sublist of contiguous records from that list. This is because the search is for a prefix string of the name field, so any records that match the search key are adjacent, when sorted by the name field.)
Thus, to create a table to look up any key of 0 to 4 characters, you need fewer than 534 entries in a table of pairs, where each member of the pair contains a record number (32 bits or fewer). So 8•534 = 60.2 MiB suffices. (53 is because you have 52 characters plus one sentinel character to mark the end of the key. Alternate encodings could reduce this some.)
To support keys of more than 4 characters, you need to extend this. With typical data, 4 characters will have narrowed down the search greatly, so you can take the set of records indicated by the first 4 characters and prune it to get the final results. If the data has pathological cases where 4 characters does not reduce the search much, you can embellish this technique.
So, is that really what you want to do, make the speed as fast as possible, regardless of other resources (including engineering time) consumed? If not, what are your actual goals?
I have a large table with say 10 columns. 4 of them remains null most of the times. I have a query that does null value takes any size or no size in bytes. I read few articles some of them are saying :
http://www.sql-server-citation.com/2009/12/common-mistakes-in-sql-server-part-4.html
There is a misconception that if we have the NULL values in a table it doesn't occupy storage space. The fact is, a NULL value occupies space – 2 bytes
SQL: Using NULL values vs. default values
A NULL value in databases is a system value that takes up one byte of storage and indicates that a value is not present as opposed to a space or zero or any other default value.
Can you please guide me regarding the size taken by null value.
If the field is fixed width storing NULL takes the same space as any other value - the width of the field.
If the field is variable width the NULL value takes up no space.
In addition to the space required to store a null value there is also an overhead for having a nullable column. For each row one bit is used per nullable column to mark whether the value for that column is null or not. This is true whether the column is fixed or variable length.
The reason for the discrepancies that you have observed in information from other sources:
The start of the first article is a bit misleading. The article is not talking about the cost of storing a NULL value, but the cost of having the ability to store a NULL (i.e the cost of making a column nullable). It's true that it costs something in storage space to make a column nullable, but once you have done that it takes less space to store a NULL than it takes to store a value (for variable width columns).
The second link seems to be a question about Microsoft Access. I don't know the details of how Access stores NULLs but I wouldn't be surprised if it is different to SQL Server.
The following link claims that if the column is variable length, i.e. varchar then NULL takes 0 bytes (plus 1 byte is used to flag whether value is NULL or not):
How does SQL Server really store NULL-s
The above link, as well as the below link, claim that for fixed length columns, i.e. char(10) or int, a value of NULL occupies the length of the column (plus 1 byte to flag whether it's NULL or not):
Data Type Performance Tuning Tips for Microsoft SQL Server
Examples:
If you set a char(10) to NULL, it occupies 10 bytes (zeroed out)
An int takes 4 bytes (also zeroed out).
A varchar(1 million) set to NULL takes 0 bytes (+ 2 bytes)
Note: on a slight tangent, the storage size of varchar is the length of data entered + 2 bytes.
From this link:
Each row has a null bitmap for columns
that allow nulls. If the row in that
column is null then a bit in the
bitmap is 1 else it's 0.
For variable size datatypes the
acctual size is 0 bytes.
For fixed size datatype the acctual
size is the default datatype size in
bytes set to default value (0 for
numbers, '' for chars).
Storing a NULL value does not take any space.
"The fact is, a NULL value occupies
space – 2 bytes."
This is a misconception -- that's 2 bytes per row, and I'm pretty sure that all rows use those 2 bytes regardless of whether there's any nullable columns.
A NULL value in databases is a system
value that takes up one byte of
storage
This is talking about databases in general, not specifically SQL Server. SQL Server does not use 1 byte to store NULL values.
Even though this questions is specifically tagged as SQL Server 2005, being that it is now 2021, it should be pointed out that it is a "trick question" for any version of SQL Server after 2005.
This is because if either ROW or PAGE compression are used, or if the column is defined as SPARSE, then it will "no space" in the actual row to store a 'NULL value'. These were added in SQL Server 2008.
The implementation notes for ROW COMPRESSION (which is a prerequisite for PAGE COMPRESSION) states:
NULL and 0 values across all data types are optimized and take no bytes1.
While there is still minimal metadata (4 bits per column + (record overhead / columns)) stored per non-sparse column in each physical record2, it's strictly not the value and is required in all cases3.
SPARSE columns with a NULL value take up no space and no relevant per-row metadata (as the number of SPARSE columns increase), albeit with a trade-off for non-NULL values.
As such, it is hard to "count" space without anlyzing the actual DB usage stats. The average bytes per row will vary based on precise column types, table/index rebuild settings, actual data and duplicity, fill capacity, effective page utilization, fragmentation, LOB usage, etc. and is often a more useful metric.
1 SQLite uses a similar approach to have effectively-free NULL values.
2 A brief of the technical layout used in ROW (and thus PAGE) compression can found in "SQL Server 2012 Internals: Special Storage".
Following the 1 or 2 bytes for the number of columns is the CD array, which uses 4 bits [of metadata] for each column in the table to represent information about the length of the column .. 0 (0×0) indicates that the corresponding column is NULL.
3 Fun fact: with ROW compression, bit column values exist entirely in the corresponding 4-bit metadata.
I've been trawling books online and google incantations trying to find out what fill factor physically is in a leaf-page (SQL Server 2000 and 2005).
I understand that its the amount of room left free on a page when an index is created, but what I've not found is how that space is actually left: i.e., is it one big chunk towards the end of the page, or is it several gaps through that data.
For example, [just to keep the things simple], assume a page can only hold 100 rows. If the fill-factor is stated to be 75%, does this mean that the first (or last) 75% of the page is data and the rest is free, or is every fourth row free (i.e., the page looks like: data, data, data, free, data, data, data, free, ...).
The long and short of this is that I'm getting a handle on exactly what happens in terms of physical operations that occur when inserting a row into a table with a clustered index, and the insert isn't happening at the end of the row. If multiple gaps are left throught a page, then an insert has minimal impact (at least until a page split) as the number of rows that may need to be moved to accomodate the insert is minimised. If the gap is in one big chunk in the table, then the overhead to juggle the rows around would (in theory at least) be significantly more.
If someone knows an MSDN reference, point me to it please! I can't find one at the moment (still looking though). From what I've read it's implied that it's many gaps - but this doesn't seem to be explicitly stated.
From MSDN:
The fill-factor setting applies only when the index is created, or rebuilt. The SQL Server Database Engine does not dynamically keep the specified percentage of empty space in the pages. Trying to maintain the extra space on the data pages would defeat the purpose of fill factor because the Database Engine would have to perform page splits to maintain the percentage of free space specified by the fill factor on each page as data is entered.
and, further:
When a new row is added to a full index page, the Database Engine moves approximately half the rows to a new page to make room for the new row. This reorganization is known as a page split. A page split makes room for new records, but can take time to perform and is a resource intensive operation. Also, it can cause fragmentation that causes increased I/O operations. When frequent page splits occur, the index can be rebuilt by using a new or existing fill factor value to redistribute the data.
SQL Server's data page consists of the following elements:
Page header: 96 bytes, fixed.
Data: variable
Row offset array: variable.
The row offset array is always stored at the end of the page and grows backwards.
Each element of the array is the 2-byte value holding the offset to the beginning of each row within the page.
Rows are not ordered within the data page: instead, their order (in case of clustered storage) is determined by the row offset array. It's the row offsets that are sorted.
Say, if we insert a 100-byte row with cluster key value of 10 into a clustered table and it goes into a free page, it gets inserted as following:
[00 - 95 ] Header
[96 - 195 ] Row 10
[196 - 8190 ] Free space
[8190 - 8191 ] Row offset array: [96]
Then we insert a new row into the same page, this time with the cluster key value of 9:
[00 - 95 ] Header
[96 - 195 ] Row 10
[196 - 295 ] Row 9
[296 - 8188 ] Free space
[8188 - 8191 ] Row offset array: [196] [96]
The row is prepended logically but appended physically.
The offset array is reordered to reflect the logical order of the rows.
Given this, we can easily see that the rows are appended to the free space, starting from the beginning on the page, while pointers to the rows are prepended to the free space starting from the end of the page.
This is the first time I've thought of this, and I'm not positive about the conclusion, but,
Since the smallest amount of data that can be retrieved by SQL Server in a single Read IO is one complete page of data, why would any of the rows within a single page need to be sorted in the first place? I'd bet that they're not, so that even if the gap is all in one big gap at the end, new records can be added at the end regardless of whether that's the right sort order. (if there's no reason to sort records on a page in the first place)
And, secondly, thinking about the write side of thge IO, I think the smallest write chunk is an entire page as well, (even the smallest change requires the entire page be written back to disk). This means that all the rows on a page could get sorted in memory every time the page is written to, so even if you were inserting into the beginning of a sorted set of rows on a dingle page, the whole page gets read out, the new record could be inserted into it's proper slot in the set in memory, and then the whole new sorted page gets written back to disk...