According to reputable websites:
ACCESS_METHODS_HOBT_VIRTUAL_ROOT
https://learn.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/sys-dm-os-latch-stats-transact-sql?view=sql-server-ver15
access_methods_hobt_virtual_root used to synchronize access to the root page abstraction of an internal b-tree.
paul randal
https://www.sqlskills.com/blogs/paul/most-common-latch-classes-and-what-they-mean/
access_methods_hobt_virtual_root
this latch is used to access the metadata for an index that contains the page id of the index’s root page.
contention on this latch can occur when a b-tree root page split occurs (requiring the latch in ex mode)
and threads wanting to navigate down the b-tree (requiring the latch in sh mode) have to wait. this could
be from very fast population of a small index using many concurrent connections, with or without page
splits from random key values causing cascading page splits (from leaf to root).
How do I tune SQL Server to limit this wait type?
Related
Context on the problem statement.
Scroll to the bottom for questions.
Note: The tables are not relational, joins can be done at application level.
Classes
Record
Most atomic unit of the database (each record has key, value, id)
Page
Each file can store multiple records. Each page is a limited chunk (8 kb??), and it also stores an offset to retrieve each id at the top?
Index
A B-tree data structure, that stores ability to do log(n) lookups to find which id lives in which page.
We can also insert id's and page into this B-tree.
Table
Each Table is an abstraction over a directory that stores multiple pages.
Table also stores Index.
Database
Database is an abstraction over a directory which includes all tables that are a part of that database.
Database Manager
Gives ability to switch between different databases, create new databases, and drop existing databases.
Communication In Main Process
Initiates the Database Manager as it's own process.
When the process quits it saves Indexes back to disk.
The process also stores the indexes back to disk based on an interval.
To interact with this DB process we will use http to communicate with it.
Database Manager stores a reference to the current database being used.
The current database attribute stored in the Database Manager stores a reference to all Table's in a hashmap.
Each Table stores a reference to the index that is read from the index page from disk and kept in memory.
Each Table exposes public methods to set and get key value pair.
Get method navigates through b-tree to find the right page, on that page it finds the key val pair based on the offset stored on the first line, and returns a Record.
Each Set method adds a key val pair to the database and then updates the index for that table.
Outstanding Questions:
Am I making any logical errors in my design above?
How should I go about figuring what the data page size should be (Not sure why relation DB's do 8gb)?
How should I store the Index B-tree to disk?
Should the Database load all indexes for the table in memory at the very start ?
A couple of notes from the top of my head:
How many records do you anticipate storing? What are the maximum key and value sizes? I ask, because with a file per page scheme, you might find yourself exhausting available file handles.
Are the database/table distinctions necessary? What does this separation gain you? Truly asking the question, not being socratic.
I would define page size in terms of multiples of your maximum key and value sizes so that you can get good read/write alignment and not have too much fragmentation. It might be worth having a naive, but space inefficient, implementation that aligns all writes.
I would recommend starting with the simplest possible design (load all indices, align writes, flat schema) to launch your database, then layer on complexity and optimizations as they become needed, but not a moment before. Hope this helps!
For simplicity, lets suppose we have some non-leaf page A where key is int.
We want to find key 4812 and at this point we have entries 2311 and 5974.
So, current thread acquires a shared latch for that page and calculates that it needs leaf page B (for data between 2311 and 5974).
At the same time, some other thread is inserting on page B, previously acquiring exclusive latch on it.
Because of insert, it has to split page on entry 3742 and create new Page C with upper half of data.
First thread has finished reading and releases the latch on Page A.
If it tries to find key 4812 on Page B (after exclusive latch is released) it won't find it, because it was moved to Page C during page split.
If I understand correctly, latch is implemented with spinlock and it should be short lived.
In order to prevent this kind of problem, writer thread would have to keep latches on all traversed non-leaf pages, which would be extremely inefficient.
I have basically 2 questions:
Is latch on page level only or it can be on row level also? I couldn't find information about that. If that was the case, then impact wouldn't be that big, but it would still be wasteful when there are no page splits (and that's mostly the case).
Is there some other mechanism to cover this?
My question is about Sql Server because I'm familiar with its internals, but this should apply to mostly any other database.
I've read in a book that
Flink maintains one state instance per keyvalue and partitions all records with the same key to the
operator task that maintains the state for this key.
my question is:
lets say i have 4 tasks with 2 slots each.
and there's a key that belongs to 95% of the data.
does it means that 95% the data is routed to the same machine?
Yes, it does mean that. If you have a hot key, then partitioning by key doesn't scale well.
In some cases, there are ways to work around this limitation. For example, if you are computing analytics (e.g., you want to count page views per page per minute, and one page gets 95% of the page views), you can do pre-aggregation -- split the work for the hot key across several parallel instances, and then do one final, non-parallel reduction of the partial results. (This is just standard map/reduce logic.)
This is called "data skew" and it is the bane of scalable applications everywhere.
It's also possible that the entire (100%) load goes to the same machine. There's no guarantee that the data is spread as evenly as possible by key, only that each key gets processed on a single machine. Technically, each key gets mapped to a key group (the number of key groups is the max parallelism for the topology) and each key group gets handled by a specific instance of an operator.
One way to handle this situation involves adding a second field to the key, resulting in a greater number of possible keys and possibly reducing the data skew across the keys. Then aggregate the results in a subsequent operator using just the one original key.
I was wondering why I would need to use lock mode page on a table.
Recently I came up to a pretty good case of why not. While I was trying to insert a row on a table I got a deadlock. After lots of investigation I figured out the the lock level of my table was Page and this was the actual reason that lead to the deadlock.
My guess is that this is a common scenario on large scale high performance environments with multiple applications hitting the same db
The only thing I found is that I should use page locking if I am processing rows in the same order as the paging occurs. This looks like a weak condition that can seldom be met (especially for scaling which could render this case obsolete).
I can see why one would lock a full table or use per row locking but the Page locking does not make much sense. Or does it?
You never need to use LOCK MODE PAGE on a table, but you may choose to do so.
It does no damage whatsoever if only a single row fits on a page (or a single row requires more than one page).
If you can fit multiple rows on a page, though, you have a meaningful choice between LOCK MODE PAGE and LOCK MODE ROW. Clearly, if you use LOCK MODE ROW, then the fact that one process has a lock on one row of a page won't prevent another process from gaining a lock on a different row on the same page, whereas LOCK MODE PAGE will prevent that.
The advantage of LOCK MODE PAGE is that it requires less locks when a single process updates multiple rows on a page in a single transaction.
So, you have to do a balancing act. You can take the view that there are so many rows in the database that the chances of two processes needing to lock different rows on the same page is negligible, and use LOCK MODE PAGE knowing that there's a small risk that you'll have processes blocking other processes that would not be blocked if you used LOCK MODE ROW. Alternatively, you can take the view that the risk of such blocking is unacceptable and the increased number of locks is not a problem, and decide to use LOCK MODE ROW anyway.
Historically, when the number of locks was a problem because memory was scarce (in the days when big machines had less than a 100 MiB of main memory!), saving locks by using LOCK MODE PAGE made more sense than it does now when systems have multiple gigabytes of main memory.
Note that it doesn't matter which lock mode you use if two processes want to update the same row; one will get a lock and block the other until the transaction commits (or until the statement completes if you aren't using explicit transactions).
Note that the default lock mode is still LOCK MODE PAGE, mainly in deference to history where that has always been the case. However, there is an ONCONFIG parameter, DEF_TABLE_LOCKMODE, that you can set to row (instead of page) that will set the default table lock mode to LOCK MODE ROW. You can still override that explicitly in a DDL statement, but if you don't specify an explicit lock mode, the default will be row or page depending on the setting of DEF_TABLE_LOCKMODE.
Here's the situation. Multi-million user website. Each user's page has a message section. Anyone can visit a user's page, where they can leave a message or view the last 100 messages.
Messages are short pieces of txt with some extra meta-data. Every message has to be stored permanently, the only thing that must be real-time quick is the message updates and reading (people use it as chat). A count of messages will be read very often to check for changes. Periodically, it's ok to archive off the old messages (those > 100), but they must be accessible.
Currently all in one big DB table, and contention between people reading the messages lists and sending more updates is becoming an issue.
If you had to re-architect the system, what storage mechanism / caching would you use? what kind of computer science learning can be used here? (eg collections, list access etc)
Some general thoughts, not particular to any specific technology:
Partition the data by user ID. The idea is that you can uniformly divide the user space to distinct partitions of roughly the same size. You can use an appropriate hashing function to divide users across partitions. Ultimately, each partition belongs on a separate machine. However, even on different tables/databases on the same machine this will eliminate some of the contention. Partitioning limits contention, and opens the door to scaling "linearly" in the future. This helps with load distribution and scale-out too.
When picking a hashing function to partition the records, look for one that minimizes the number of records that will have to be moved should partitions be added/removed.
Like many other applications, we could assume the use of the service follows a power law curve: few of the user pages cause much of the traffic, followed by a long tail. A caching scheme can take advantage of that. The steeper the curve, the more effective caching will be. Given the short messages, if each page shows 100 messages, and each message is 100 bytes on average, you could fit about 100,000 top-pages in 1GB of RAM cache. Those cached pages could be written lazily to the database. Out of 10 Mil users, 100,000 is in the ballpark for making a difference.
Partition the web servers, possibly using the same hashing scheme. This lets you hold separate RAM caches without contention. The potential benefit is increasing the cache size as the number of users grows.
If appropriate for your environment, one approach for ensuring new messages are eventually written to the database is to place them in a persistent message queue, right after placing them in the RAM cache. The queue suffers no contention, and helps ensure messages are not lost upon machine failure.
One simple solution could be to denormalize your data, and store pre-calculated aggregates in a separate table, e.g. a MESSAGE_COUNTS table which has a column for the user ID and a column for their message count. When the main messages table is updated, then re-calculate the aggregate.
It's just shifting the bottleneck from one place to another, but it might move it somewhere that's less of a burden.