From what I Google, sqlite does't support clustered indexes(see Four: Clustered Indexes), but what I doesn't understand are:
This means that if your index is
sequential INTEGER, the records are
physically laid out in the database in
that INTEGERs order, 1 then 2 then 3.
If some records are deleted from a table which contains sequential int index, where the records of the new insertion be placed? From what I know, the records int ID will only grow,so the records will be appended to tail, right? Does it means the deleted places are wasted?
In the case of no sequential INTEGER index, is the sqlite table a heap table? i.e. the record will be placed where the free space is first found.
1) Right, records will be happened to tail. Yes, deleted places may be wasted if the database engine cannot re-use it easily. The unused places will be deleted when you compact the database using command VACUUM.
2) Yes such a SQL table is a heap table. But indexes (of any kind) are precisely made to access data as if records were sorted. Indexes are sorted values linked to records. But the new records are not necessarily placed where the free space is first found. They are placed where the database engine fill nice to place them taking into account unused space and time to write the binary data (happening in queue is faster than inserting in the middle).
Related
I have a CRM system where I am reviewing the structure of the SQL Server 2017 database.
There is a History table with a clustered index on Date_Entered (datetime) and Time_Entered (varchar(8)). There are around 30 million rows (and counting).
When a customer record is opened, their most recent history is displayed in reverse date / time order.
With the index in ascending order, new rows are added at the end of the table. If I were to rebuild it in descending date / time order, I assume that displaying the records would be quicker (though it isn't slow anyway), but would that mean that new records would be inserted at the beginning, and therefore lead to constant page splits, or is that not the case?
You won't end up with constant page splits. You probably think that because you have the wrong mental image of clustered indexes and think that they are stored sorted in physical order so new rows must somehow get inserted physically at the beginning. This is not the case.
But - don't do that as you will end up with 99.x% logical fragmentation without page splits. The new physical pages will tend to be allocated from later in the file than previous ones whereas the logical order of the index would need them to be allocated from earlier (example).
Assuming inserts are in order of Date_Entered, Time_Entered then each new page will be filled up and then a new page allocated and filled up. There is no need for page splits that move rows between pages - just allocation of new pages. The only issue is that the physical page order will be likely a complete reversal of the logical index order until you rebuild or reorganize the index.
I assume that displaying the records would be quicker
No, it is not needed anyway as the leaf pages of the index are linked together in a doubly linked list and can be read in either forward or backwards direction.
Backward scans can't be parallelised but hopefully this isn't a concern for you.
In any event for displaying the customer record probably you are using a different index with leading column customerid making this largely irrelevant.
I'm new to database. Recently I start using timescaledb, which is an extension in PostgreSQL, so I guess this is also PostgreSQL related.
I observed a strange behavior. I calculated my table structure, 1 timestamp, 2 double, so totally 24bytes per row. And I imported (by psycopg2 copy_from) 2,750,182 rows from csv file. I manually calculated the size should be 63MB, but I query timescaledb, it tells me the table size is 137MB, index size is 100MB and total 237MB. I was expecting that the table size should equal my calculation, but it doesn't. Any idea?
There are two basic reasons your table is bigger than you expect:
1. Per tuple overhead in Postgres
2. Index size
Per tuple overhead: An answer to a related question goes into detail that I won't repeat here but basically Postgres uses 23 (+padding) bytes per row for various internal things, mostly multi-version concurrency control (MVCC) management (Bruce Momjian has some good intros if you want more info). Which gets you pretty darn close to the 137 MB you are seeing. The rest might be because of either the fill factor setting of the table or if there are any dead rows still included in the table from say a previous insert and subsequent delete.
Index Size: Unlike some other DBMSs Postgres does not organize its tables on disk around an index, unless you manually cluster the table on an index, and even then it will not maintain the clustering over time (see https://www.postgresql.org/docs/10/static/sql-cluster.html). Rather it keeps its indices separately, which is why there is extra space for your index. If on-disk size is really important to you and you aren't using your index for, say, uniqueness constraint enforcement, you might consider a BRIN index, especially if your data is going in with some ordering (see https://www.postgresql.org/docs/10/static/brin-intro.html).
I have another question but i'll be more specific.
I see that when selecting a million row table it takes < 1second. What I don't understand is how might it do this with indexes. It seems to take 10ms to do a seek so for it to succeed 1sec it must do <100seeks. If there is an index entry per row then 1M rows is at least 1K blocks to store the indexes (actually its higher if its 8bytes per row (32bit index value + 32 key offset)). Then we would need to actually travel to the rows and collect the data. How do databases keep the seeks low and pull that data as fast as they do?
One way is something called a 'clustered index', where the rows of the table are physically ordered according to the clustered index's sort. Then when you want to read in a range of values along the indexed field, you find the first one, and you can just read it all in at once with no extra IO.
Also:
1) When reading an index, a large chunk of the index will be read in at once. If descending the B-tree (or moving along the children at the bottom, once you've found your range) moves you to another node already read into memory, you've saved an IO.
2) If the number of records that the SQL server statistically expects to retrieve are so high that the random access requirement of going from the index to the underlying rows will require so many IO operations that it would be faster to do a table scan, then it will do a table scan instead. You can see this e.g. using the query planner in SQL Server or PostgreSQL. But for small ranges the index is usually better, and the query plan will reflect this.
Can anyone explain why databases tend to use b-tree indexes rather than a linked list of ordered elements.
My thinking is this: On a B+ Tree (used by most databases), the none-leaf nodes are a collection of pointers to other nodes. Each collection (node) is a ordered list. The leaf nodes, which is where all the data pointers are, is a linked list of clusters of data pointers.
The non-leaf nodes are just used to find the correct leaf node in which your target data pointer lives. So as the leaf nodes are just like a linked list, then why not just do away with the tree elements and just have the linked list. Meta data can be provided which gives the minimum and maximum value of each leaf node cluster, so the application can just read the meta data and find the correct leaf where the data pointer lives.
Just to be clear that the most efficent algorithm for searching an random accessed ordered list is an binary search which has a performance of O(log n) which is the same as a b-tree. The benifit of using a linked list rather than a tree is that they don't need to be ballanced.
Is this structure feasible.
After some research and paper reading I found the answer.
In order to cope with large amounts of data such a millions of records, indexes have to be organised into clusters. A cluster is a continuous group of sectors on a disk that can be read into memory quickly. These are usually about 4096 bytes long.
Each one of these clusters can contain a bunch of indexes which can point to other clusters or data on a disk. So if we had a linked list index, each element of the index would be made up of the collection of indexes contained in a single cluster (say 100).
So, when we are looking for a specific record, how do we know which cluster it is on. We perform a binary search to find the cluster in question [O(log n)].
However, to do a binary search we need to know where the range of values in each clusters, so we need meta-data that says the min and max value of each cluster and where that cluster is. This is great. Except if each cluster can contain 100 indexes, and our meta data is also held on a single cluster (for speed) , then our meta data can only point to 100 clusters.
What happens if we want more than 100 clusters. We have to have two meta-data indexes, each pointing to 100 clusters (10 000 records). Well that’s not enough. Lets add another meta-data cluster and we can now access 1 000 000 records. So how do we know which one of the three meta-data clusters we need to query in order to find our target data cluster. We could search one then the other, but that doesn’t scale. So I add another meta-meta-data cluster to indicate which one of the three meta-data clusters I should query to find the target data cluster. Now I have a tree!
So that’s why databases use trees. It’s not the speed it’s the size of the indexes and the need to have indexes referencing other indexes. What I have described above is a B+Tree – child nodes contain references to other child nodes or leaf nodes, and leaf nodes contain references to data on disk.
Phew!
I guess I answered that question in Chapter 1 of my SQL Indexing Tutorial: http://use-the-index-luke.com/sql/anatomy
To summarize the most important parts, with respect to your particular question:
-- from "The Leaf Nodes"
The primary purpose of an index is to provide an ordered
representation of the indexed data. It is, however, not possible to
store the data sequentially because an insert statement would need to
move the following entries to make room for the new one. But moving
large amounts of data is very time-consuming, so that the insert
statement would be very slow. The problem's solution is to establish a
logical order that is independent of physical order in memory.
-- from "The B-Tree":
The index leaf nodes are stored in an arbitrary order—the position on
the disk does not correspond to the logical position according to the
index order. It is like a telephone directory with shuffled pages. If
you search for “Smith” in but open it at “Robinson” in the first
place, it is by no means granted that Smith comes farther back.
Databases need a second structure to quickly find the entry among the
shuffled pages: a balanced search tree—in short: B-Tree.
Linked lists are usually not ordered by key value, but by the moment of insertion: insertion is done at the end of list and each new entry contains a pointer to the previous entry of the list.
They are usually implemented as heap structures.
This has 2 main benefits:
they are very easy to manage (you just need a pointer for each element)
if used in combination with an index you can overcome the problem of sequential access.
If instead you use an ordered list, by key value, you will have ease of access (binary search), but encounter problems each time you edit, delete, insert a new element: you must infact keep your list ordered after performing operation, making algorithms more complex and time consuming.
B+ trees are better structures, having all the properties you stated, and other advantages:
you can make group searches (by intervals of key values) with same cost of a single search: since elements in the leafs result automatically ordered thanks to the insertion algorithm, which is not possible in linked lists cause it would require many linear searches over the list.
cost is logarithmic with number of elements contained and especially since these structures are kept balanced cost of access does not depend on the particulare value you are looking for (very usefull).
these structures are very efficient in update, insert or delete operations.
I have a table myTable with a unique clustered index myId with fill factor 100%
Its an integer, starting at zero (but its not an identity column for the table)
I need to add a new type of row to the table.
It might be nice if I could distinguish these rows by using negative values of myId.
Would having negative values incur extra page splitting and slow down inserts?
Extra Background:
This table exists as part of the etl for a data warehouse that gathers data from disparate systems. I now want to accomodate a new type of data. A way for me to do this is to reserve negative ids for this new data, which will thus be automatically clustered. This will also avoid major key changes or extra columns in the schema.
Answer Summary:
Fill factors of 100% will noramlly slow down the inserts. But not inserts that happen sequentially, and that includes the sequntial negative inserts.
Besides the practical administration points you already got and the suspect dubious use of negative ids to represent data model attributes, there is also a valid question here: give a table with int ids from 0 to N, inserting new negative values where would those value go and would they cause additional splits?
The initial rows will be placed on the clustered index leaf pages, row with id 0 on first page and row with id N on the last page, filling the pages in between. When the first row with value of -1 is inserted, this will sort ahead of row with id 0 and as such will add a new page to the tree (will allocate an extent of 8 pages actually, but that is a different point) and will link the page in front of the leaf level linked list of pages. This will NOT cause a page split of the former first page. On further inserts of values -2, -3 etc they will go to the same new page and they will be inserted in the proper position (-2 ahead of -1, -3 ahead of -2 etc) until the page fills. Further inserts will add a new page ahead of this one, that will accommodate further new values. Inserts of positive values N+1, N+2 will go at the last page and be placed in it until it fills, then they'll cause a new page to be added and will start filling that page.
So basically the answer is this: inserts at either end of a clustered index should not cause page splits. Page splits can be caused only by inserts between two existing keys. This actually extends to the non-leaf pages as well, an index at either end of the cluster may not split a non-leaf page either. I do not discuss here the impact of updates of course (they can can cause splits if the increase the length of a variable length column).
Lately has been a lot of talk in the SQL Server blogosphere about the potential performance problems of page splits, but I must warn against going to unnecessary extremes to avoid them. Page splits are a normal index operation. If you find yourself in an environment where the page split performance hit is visible during inserts, then you'll be probably worse hit by the 'mitigation' measures because you'll create artificial page latch hot spots that are far worse as they'll affect every insert. What is true is that prolonged operation with frequent splits will result in high fragmentation which impacts the data access time. I say that is best mitigated with off-peak periodical index maintenance operation (reorganize). Avoid premature optimizations, always measure first.
Not enough to notice for any reasonable system.
Page splits happen when a page is full, either at the start or at the end of the range.
As long as you regular index maintenance...
Edit, after Fill factor comments:
After a page split wth 90 or 100 FF, each page will be 50% full. FF = 100 only means an insert will happen sooner (probably 1st insert).
With a strictly monotonically increasing (or decreasing) key (+ve or -ve), a page split happens at either end of the range.
However, from BOL, FILLFACTOR
Fill
Adding Data to the End of the Table
A nonzero fill factor other than 0 or
100 can be good for performance if the
new data is evenly distributed
throughout the table. However, if all
the data is added to the end of the
table, the empty space in the index
pages will not be filled. For example,
if the index key column is an IDENTITY
column, the key for new rows is always
increasing and the index rows are
logically added to the end of the
index. If existing rows will be
updated with data that lengthens the
size of the rows, use a fill factor of
less than 100. The extra bytes on each
page will help to minimize page splits
caused by extra length in the rows.
So does, fillfactor matter for strictly monotonic keys...? Especially if it's low volume writes
No, not at all. Negative values are just as valid as INTegers as positive ones. No problem. Basically, internally, they're all just 4 bytes worth of zeroes and ones :-)
Marc
You are asking the wrong question!
If you create a clustered index that has a fillfactor of 100%, every time a record is inserted, deleted or even modified, page splits can occur because there is likely no room on the existing index data page to write the change.
Even with regular index maintenance, a fill factor of 100% is counter productive on a table where you know inserts are going to be performed. A more usual value would be 90%.
I'm concerned that this post may have taken a wrong turn, in that there seems to be an underlying design issue at work here, irrespective of the resultant page splits.
Why do you need to introduce a negative ID?
An integer primary key, for example, should uniquely indentify a row, it's sign should be irrelevant. I suspect that there may be a definition issue with the primary key for your table if this is not the case.
If you need to flag/identify the newly inserted records then create a column specifically for this purpose.
This solution would be ideal because you may then be able to ensure that your primary key is sequential (perhaps using an Identity data type, although not essential), thereby avoiding issues with page splits (on insert) altogether.
Also, to confirm if I may, a fill factor of 100% for a clustered index primary key (identity integer for example), will not cause page splits for sequential inserts!