I have a CRM system where I am reviewing the structure of the SQL Server 2017 database.
There is a History table with a clustered index on Date_Entered (datetime) and Time_Entered (varchar(8)). There are around 30 million rows (and counting).
When a customer record is opened, their most recent history is displayed in reverse date / time order.
With the index in ascending order, new rows are added at the end of the table. If I were to rebuild it in descending date / time order, I assume that displaying the records would be quicker (though it isn't slow anyway), but would that mean that new records would be inserted at the beginning, and therefore lead to constant page splits, or is that not the case?
You won't end up with constant page splits. You probably think that because you have the wrong mental image of clustered indexes and think that they are stored sorted in physical order so new rows must somehow get inserted physically at the beginning. This is not the case.
But - don't do that as you will end up with 99.x% logical fragmentation without page splits. The new physical pages will tend to be allocated from later in the file than previous ones whereas the logical order of the index would need them to be allocated from earlier (example).
Assuming inserts are in order of Date_Entered, Time_Entered then each new page will be filled up and then a new page allocated and filled up. There is no need for page splits that move rows between pages - just allocation of new pages. The only issue is that the physical page order will be likely a complete reversal of the logical index order until you rebuild or reorganize the index.
I assume that displaying the records would be quicker
No, it is not needed anyway as the leaf pages of the index are linked together in a doubly linked list and can be read in either forward or backwards direction.
Backward scans can't be parallelised but hopefully this isn't a concern for you.
In any event for displaying the customer record probably you are using a different index with leading column customerid making this largely irrelevant.
Related
I used to think before that when I update a indexed column in table, at the same time index is also updated. But during one of my interview, interviewer was stressing that it doesn't work that way. For any update in base table, index will rebuild/reorganize. Although I am pretty sure that this can't happen as both operations are very costly, still want to make sure with expert's view.
While thinking about this, one more thing came to my mind. Say I have index column values 1-1000. So as per B-Tree structure, say value 999, will go to right most nodes from top to bottom. Now if I updated this column from 999 to 2, a lot of shuffling will be required to adjust this value in the index B-Tree. How it will be taken care if index rebuild/reorganize doesn't happen after base table update.
I used to think before that when I update a indexed column in table,
at the same time index is also updated.
Yes, that's true. As is for deletes and inserts.
Other indexing systems, may work differently and need to be updated incrementally or rebuild in its entirely separate from the indexed data. This may be confusing.
Statistics need to be updated separately. (See other active discussions in this group.)
For any update in base table, index will rebuild/reorganize.
No, but if SQL Server cannot fit the node in it's physical place a page split may occur. Or when a key value changes a single psychical row movement may occur.
Both may causes fragmentation. Too much fragmentation may cause performance issues. That's why DBA's find it necessary to reduce fragmentation by rebuilding or reorganizing an index at a convenient time.
Say I have index column values 1-1000. So as per B-Tree structure, say value 999, will go to right most nodes from top to bottom. Now if I updated this column from 999 to 2, a lot of shuffling will be required to adjust this value in the index B-Tree. How it will be taken care if index rebuild/reorganize doesn't happen after base table update.
Only the changed row is moved to another slot in another page in the B-Tree. The original slot will remain empty. If the new page is full, a page split occurs. This causes a change in the parent page, which may another page split occur if that page is also full, and so on. Al those events may cause fragmentation which may cause performance degradation.
I have another question but i'll be more specific.
I see that when selecting a million row table it takes < 1second. What I don't understand is how might it do this with indexes. It seems to take 10ms to do a seek so for it to succeed 1sec it must do <100seeks. If there is an index entry per row then 1M rows is at least 1K blocks to store the indexes (actually its higher if its 8bytes per row (32bit index value + 32 key offset)). Then we would need to actually travel to the rows and collect the data. How do databases keep the seeks low and pull that data as fast as they do?
One way is something called a 'clustered index', where the rows of the table are physically ordered according to the clustered index's sort. Then when you want to read in a range of values along the indexed field, you find the first one, and you can just read it all in at once with no extra IO.
Also:
1) When reading an index, a large chunk of the index will be read in at once. If descending the B-tree (or moving along the children at the bottom, once you've found your range) moves you to another node already read into memory, you've saved an IO.
2) If the number of records that the SQL server statistically expects to retrieve are so high that the random access requirement of going from the index to the underlying rows will require so many IO operations that it would be faster to do a table scan, then it will do a table scan instead. You can see this e.g. using the query planner in SQL Server or PostgreSQL. But for small ranges the index is usually better, and the query plan will reflect this.
From what I Google, sqlite does't support clustered indexes(see Four: Clustered Indexes), but what I doesn't understand are:
This means that if your index is
sequential INTEGER, the records are
physically laid out in the database in
that INTEGERs order, 1 then 2 then 3.
If some records are deleted from a table which contains sequential int index, where the records of the new insertion be placed? From what I know, the records int ID will only grow,so the records will be appended to tail, right? Does it means the deleted places are wasted?
In the case of no sequential INTEGER index, is the sqlite table a heap table? i.e. the record will be placed where the free space is first found.
1) Right, records will be happened to tail. Yes, deleted places may be wasted if the database engine cannot re-use it easily. The unused places will be deleted when you compact the database using command VACUUM.
2) Yes such a SQL table is a heap table. But indexes (of any kind) are precisely made to access data as if records were sorted. Indexes are sorted values linked to records. But the new records are not necessarily placed where the free space is first found. They are placed where the database engine fill nice to place them taking into account unused space and time to write the binary data (happening in queue is faster than inserting in the middle).
I have a table myTable with a unique clustered index myId with fill factor 100%
Its an integer, starting at zero (but its not an identity column for the table)
I need to add a new type of row to the table.
It might be nice if I could distinguish these rows by using negative values of myId.
Would having negative values incur extra page splitting and slow down inserts?
Extra Background:
This table exists as part of the etl for a data warehouse that gathers data from disparate systems. I now want to accomodate a new type of data. A way for me to do this is to reserve negative ids for this new data, which will thus be automatically clustered. This will also avoid major key changes or extra columns in the schema.
Answer Summary:
Fill factors of 100% will noramlly slow down the inserts. But not inserts that happen sequentially, and that includes the sequntial negative inserts.
Besides the practical administration points you already got and the suspect dubious use of negative ids to represent data model attributes, there is also a valid question here: give a table with int ids from 0 to N, inserting new negative values where would those value go and would they cause additional splits?
The initial rows will be placed on the clustered index leaf pages, row with id 0 on first page and row with id N on the last page, filling the pages in between. When the first row with value of -1 is inserted, this will sort ahead of row with id 0 and as such will add a new page to the tree (will allocate an extent of 8 pages actually, but that is a different point) and will link the page in front of the leaf level linked list of pages. This will NOT cause a page split of the former first page. On further inserts of values -2, -3 etc they will go to the same new page and they will be inserted in the proper position (-2 ahead of -1, -3 ahead of -2 etc) until the page fills. Further inserts will add a new page ahead of this one, that will accommodate further new values. Inserts of positive values N+1, N+2 will go at the last page and be placed in it until it fills, then they'll cause a new page to be added and will start filling that page.
So basically the answer is this: inserts at either end of a clustered index should not cause page splits. Page splits can be caused only by inserts between two existing keys. This actually extends to the non-leaf pages as well, an index at either end of the cluster may not split a non-leaf page either. I do not discuss here the impact of updates of course (they can can cause splits if the increase the length of a variable length column).
Lately has been a lot of talk in the SQL Server blogosphere about the potential performance problems of page splits, but I must warn against going to unnecessary extremes to avoid them. Page splits are a normal index operation. If you find yourself in an environment where the page split performance hit is visible during inserts, then you'll be probably worse hit by the 'mitigation' measures because you'll create artificial page latch hot spots that are far worse as they'll affect every insert. What is true is that prolonged operation with frequent splits will result in high fragmentation which impacts the data access time. I say that is best mitigated with off-peak periodical index maintenance operation (reorganize). Avoid premature optimizations, always measure first.
Not enough to notice for any reasonable system.
Page splits happen when a page is full, either at the start or at the end of the range.
As long as you regular index maintenance...
Edit, after Fill factor comments:
After a page split wth 90 or 100 FF, each page will be 50% full. FF = 100 only means an insert will happen sooner (probably 1st insert).
With a strictly monotonically increasing (or decreasing) key (+ve or -ve), a page split happens at either end of the range.
However, from BOL, FILLFACTOR
Fill
Adding Data to the End of the Table
A nonzero fill factor other than 0 or
100 can be good for performance if the
new data is evenly distributed
throughout the table. However, if all
the data is added to the end of the
table, the empty space in the index
pages will not be filled. For example,
if the index key column is an IDENTITY
column, the key for new rows is always
increasing and the index rows are
logically added to the end of the
index. If existing rows will be
updated with data that lengthens the
size of the rows, use a fill factor of
less than 100. The extra bytes on each
page will help to minimize page splits
caused by extra length in the rows.
So does, fillfactor matter for strictly monotonic keys...? Especially if it's low volume writes
No, not at all. Negative values are just as valid as INTegers as positive ones. No problem. Basically, internally, they're all just 4 bytes worth of zeroes and ones :-)
Marc
You are asking the wrong question!
If you create a clustered index that has a fillfactor of 100%, every time a record is inserted, deleted or even modified, page splits can occur because there is likely no room on the existing index data page to write the change.
Even with regular index maintenance, a fill factor of 100% is counter productive on a table where you know inserts are going to be performed. A more usual value would be 90%.
I'm concerned that this post may have taken a wrong turn, in that there seems to be an underlying design issue at work here, irrespective of the resultant page splits.
Why do you need to introduce a negative ID?
An integer primary key, for example, should uniquely indentify a row, it's sign should be irrelevant. I suspect that there may be a definition issue with the primary key for your table if this is not the case.
If you need to flag/identify the newly inserted records then create a column specifically for this purpose.
This solution would be ideal because you may then be able to ensure that your primary key is sequential (perhaps using an Identity data type, although not essential), thereby avoiding issues with page splits (on insert) altogether.
Also, to confirm if I may, a fill factor of 100% for a clustered index primary key (identity integer for example), will not cause page splits for sequential inserts!
I have a sproc that puts 750K records into a temp table through a query as one of its first actions. If I create indexes on the temp table before filling it, the item takes about twice as long to run compared to when I index after filling the table. (The index is an integer in a single column, the table being indexed is just two columns each a single integer.)
This seems a little off to me, but then I don't have the firmest understanding of what goes on under the hood. Does anyone have an answer for this?
If you create a clustered index, it affects the way the data is physically ordered on the disk. It's better to add the index after the fact and let the database engine reorder the rows when it knows how the data is distributed.
For example, let's say you needed to build a brick wall with numbered bricks so that those with the highest number are at the bottom of the wall. It would be a difficult task if you were just handed the bricks in random order, one at a time - you wouldn't know which bricks were going to turn out to be the highest numbered, and you'd have to tear the wall down and rebuild it over and over. It would be a lot easier to handle that task if you had all the bricks lined up in front of you, and could organize your work.
That's how it is for the database engine - if you let it know about the whole job, it can be much more efficient than if you just feed it a row at a time.
It's because the database server has to do calculations each and every time you insert a new row. Basically, you end up reindexing the table each time. It doesn't seem like a very expensive operation, and it's not, but when you do that many of them together, you start to see the impact. That's why you usually want to index after you've populated your rows, since it will just be a one-time cost.
Think of it this way.
Given
unorderedList = {5, 1,3}
orderedList = {1,3,5}
add 2 to both lists.
unorderedList = {5, 1,3,2}
orderedList = {1,2,3,5}
What list do you think is easier to add to?
Btw ordering your input before load will give you a boost.
You should NEVER EVER create an index on an empty table if you are going to massively load it right afterwards.
Indexes have to be maintained as the data on the table changes, so imagine as if for every insert on the table the index was being recalculated (which is an expensive operation).
Load the table first and create the index after finishing with the load.
That's were the performance difference is going.
After performing large data manipulation operations, you frequently have to update the underlying indexes. You can do that by using the UPDATE STATISTICS [table] statement.
The other option is to drop and recreate the index which, if you are doing large data insertions, will likely perform the inserts much faster. You can even incorporate that into your stored procedure.
this is because if the data you insert is not in the order of the index, SQL will have to split pages to make room for additional rows to keep them together logically
This due to the fact that when SQL Server indexes table with data it is able to produce exact statistics of values in indexed column. At some moments SQL Server will recalculate statistics, but when you perform massive inserts the distribution of values may change after the statistics was calculated last time.
The fact that statistics is out of date can be discovered on Query Analyzer. When you see that on a certain table scan number of rows expected differs to much from actual numbers of rows processed.
You should use UPDATE STATISTICS to recalculate distribution of values after you insert all the data. After that no performance difference should be observed.
If you have an index on a table, as you add data to the table SQL Server will have to re-order the table to make room in the appropriate place for the new records. If you're adding a lot of data, it will have to reorder it over and over again. By creating an index only after the data is loaded, the re-order only needs to happen once.
Of course, if you are importing the records in index order it shouldn't matter so much.
In addition to the index overhead, running each query as a transaction is a bad idea for the same reason. If you run chunks of inserts (say 100) within 1 explicit transaction, you should also see a performance increase.