Is there a way to find a statistics on table read and write count on SQL Server 2005/2008?
I am specifically looking for DMVs/DMFs without using triggers or audits.
The goal here is to find out appropriate fill factor for indexes - got an idea from this article (Fill Factor Defined).
[UPDATE] There is a follow up question on ServerFault How to determine
Read/Write intensive table from
DMV/DMF statistics
Following query can be used to find number of read and writes on all tables in a database. This query result can be exported to CSV file and then using excel formulas you can easily calculate read/write ratio. Very useful while planning indexes on a table
DECLARE #dbid int
SELECT #dbid = db_id('database_name')
SELECT TableName = object_name(s.object_id),
Reads = SUM(user_seeks + user_scans + user_lookups), Writes = SUM(user_updates)
FROM sys.dm_db_index_usage_stats AS s
INNER JOIN sys.indexes AS i
ON s.object_id = i.object_id
AND i.index_id = s.index_id
WHERE objectproperty(s.object_id,'IsUserTable') = 1
AND s.database_id = #dbid
GROUP BY object_name(s.object_id)
ORDER BY writes DESC
sys.dm_db_index_physical_stats (size and fragmentation)
sys.dm_db_index_usage_stats (usage, number of scans/seeks/updates etc)
sys.dm_db_index_operational_stats (current activity on index)
Remember 'table' means the clustered index or the 'heap'.
To determine an appropriate fill factor for a table's indexes, you need to look at the number of page splits occuring. This is shown in sys.dm_db_index_operational_stats:
Leaf allocation count: Total number of page splits at the leaf level of the index.
Nonleaf allocation count: Total number of page splits above the leaf level of the index.
Leaf page merge count: Total number of page merges at the leaf level of the index.
After doing a bit of digging, I've seen a few posts that say the page split numbers from the DMV's are not that useful (I haven't personally confirmed this), but there is also a performance counter "page splits/sec" (but it's is only at SQL Server instance level).
I use the rule of thumb that ordinary tables use the default 90% fill factor, high insert tables somewhere between 70 - 85% (depending on row size). Read only tables can utilise a fill factor of 100%
If you have a good clustered index (i.e., ever increasing, unique, narrow) then the real determining issues for Fill Factor are how the table is updated and the data types of the columns.
If the columns are all fixed size (e.g., integer, Decimal, Float, Char) and non-nullable then an update cannot increase the storage required for a row. Given the good clustered index you should pick a Fill Factor of 90+ even 100 since page splits won't happen.
If you have a few variable length columns (e.g. a Varchar to hold User Name) and the columns are seldom updated after insert then you can still keep a relatively high Fill Factor.
If you have data that is highly variable in length (e.g., UNC paths, Comment fields, XML) then the Fill Factor should be reduced. Particularly if the columns are updated frequently and grow (like comment columns).
Non-Clustered indexes are generally the same except the index key may be more problematic (non unique, perhaps not ever increasing).
I think sys.dm_db_index_physical_stats gives the best metrics for this but it is after the fact. Look at the avg/min/max record size, avg frag size, avg page space used to get a picture of how the index space is being used.
HTH.
Related
I have a table with 3 fields (username, target_value, score) generated externally by a full cross of username's (~400,000) and target_value's (~4000) and a calculated score, leading to a total row count of ~1.6 billion.
All my query's on this table will be in the format of
SELECT *
FROM _table
WHERE target_values IN (123, 456)
My initial version of this included a BTREE index on target_values but i ended up spending 45 minutes on a BITMAP HEAP SCAN of the index.
I've also been looking at BRIN indexes, partitions and table clustering but as it takes hours to apply each approach to the table i cant exactly brute force each option and test for performance.
What are some recommendations for dealing with a single massive table with very 'blocky' data in Postgres 10?
If the table is a cross join of two data sets, why don't you store the individual tables and calculate the join as you need it? Databases are good at that.
From your description I would expect a performance gain if you ran CLUSTER on the table to physically rewrite it in index order. Then you would have to access way fewer table blocks.
Unfortunately CLUSTER will take long, makes the table unavailable and has to be repeated regularly.
An alternative that may be better is to partition the table by target_value. 4000 partitions are a bit much, so maybe use list partitioning to bundle several together into one partition.
This will allow your queries to perform fast sequential scans on only a few partitions. It will also make autovacuum's job easier.
The bottom line, however, is that if you select a lot of rows from a table, it will always take a long time.
does the number of records from a db affect the speed of select queries?
i mean if a db has 50 records and another one has 5 million records, will the selects from the 2nd one be slower? assuming i have all the indexes in the right place
Yes, but it doesn't have to be a large penalty.
At the most basic level an index is a b-tree. Performance is somewhat correlated to the number of levels in the b-tree, so a 5 record database has about 2 levels, a 5 million record database has about 22 levels. But it's binary, so a 10 million row database has 23 levels, and really, index access times are typically not the problem in performance tuning - the usual problem is tables that aren't indexed properly.
As noted by odedsh, caching is also a large contributor, and small databases will be cached well. Sqlite stores records in primary key sequence, so picking a primary key that allows records that are commonly used together to be stored together can be a big benefit.
Yeah it matters for the reasons the others said.
There's other things that can effect the speed of Select statements to, such as how many columns you're grabbing data from.
I once did some speed tests in a table with over 150 columns, where I needed to grab only about 40 of the columns, and I needed all 20,000+ records. While the speed differences were very minimal (we're talking 20 to 40 milliseconds), it was actually faster to grab the data from All the columns with a 'SELECT ALL *', rather than going 'Select All Field1, Field2, etc'.
I assume the more records and columns in your table, the greater the speed difference this example will net you, but I never had a need to test it any farther in more extreme cases like 5 million records in a table.
Yes.
If a table is tiny and the entire db is tiny when you select anything from the table it is very likely that all the data is in memory already and the result can be returned immediately.
If the table is huge but you have an index and you are doing a simple select on the indexed columns then the index can be scanned then the correct blocks can be read from disk and the result returned.
If there is no index that can be used then the db will do a full table scan reading the table block by block looking for matches.
If there is a partial map between the index columns and the select query columns then the db can try to minimize the number of blocks that should be read. And a lot of thought can be placed into properly choosing the indexes structure and type (BITMAP / REGULAR)
And this is just for the most basic SQL that selects from a single table without any calculations.
I have another question but i'll be more specific.
I see that when selecting a million row table it takes < 1second. What I don't understand is how might it do this with indexes. It seems to take 10ms to do a seek so for it to succeed 1sec it must do <100seeks. If there is an index entry per row then 1M rows is at least 1K blocks to store the indexes (actually its higher if its 8bytes per row (32bit index value + 32 key offset)). Then we would need to actually travel to the rows and collect the data. How do databases keep the seeks low and pull that data as fast as they do?
One way is something called a 'clustered index', where the rows of the table are physically ordered according to the clustered index's sort. Then when you want to read in a range of values along the indexed field, you find the first one, and you can just read it all in at once with no extra IO.
Also:
1) When reading an index, a large chunk of the index will be read in at once. If descending the B-tree (or moving along the children at the bottom, once you've found your range) moves you to another node already read into memory, you've saved an IO.
2) If the number of records that the SQL server statistically expects to retrieve are so high that the random access requirement of going from the index to the underlying rows will require so many IO operations that it would be faster to do a table scan, then it will do a table scan instead. You can see this e.g. using the query planner in SQL Server or PostgreSQL. But for small ranges the index is usually better, and the query plan will reflect this.
I have two tables:
urls (table with indexed pages, host is indexed column, 30 mln rows)
hosts (table with information about hosts, host is indexed column, 1mln rows)
One of the most frequent SELECT in my application is:
SELECT urls.* FROM urls
JOIN hosts ON urls.host = hosts.host
WHERE urls.projects_id = ?
AND hosts.is_spam IS NULL
ORDER by urls.id DESC, LIMIT ?
In projects which have more than 100 000 rows in urls table the query executes very slow.
Since the tables has grown the query is execution slower and slower. I've read a lot about NoSQL databases (like MongoDB) which are designed to handle so big tables but changing my database from PgSQL to MongoDB is for me big issue. Right now i would like try to optimize PgSQL solution. Do you have any advice for? What should i do?
This query should be fast in combination with the provided indexes:
CREATE INDEX hosts_host_idx ON hosts (host)
WHERE is_spam IS NULL;
CREATE INDEX urls_projects_id_idx ON urls (projects_id, id DESC);
SELECT *
FROM urls u
WHERE u.projects_id = ?
AND EXISTS (
SELECT 1
FROM hosts h USING (host)
WHERE h.is_spam IS NULL
)
ORDER BY urls.id DESC
LIMIT ?;
The indexes are the more important ingredient. The JOIN syntax as you have it may be just as fast. Note that the first index is a partial index and the second is a multicolumn index with DESC order on the second column.
It much depends on specifics of your data distribution, you will have to test (as always) with EXPLAIN ANALYZE to find out about performance and whether the indexes are used.
General advice about performance optimization applies, too. You know the drill.
Add an index on the hosts.host column (primarily in the hosts table, this matters), and a composite index on urls.projects_id, urls.id, run ANALYZE statement to update all statistics and observe subsecond performance regardless of spam percentage.
A slightly different advice would apply if almost everything is always spam and if the "projects", whatever they are, are few in number and and very big each.
Explanation: update of statistics makes it possible for the optimizer to recognize that the urls and hosts tables are both quite big (well, you didn't show us schema, so we don't know your row sizes). The composite index starting with projects.id will hopefully1 rule out most of the urls content, and its second component will immediately feed the rest of urls in the desired order, so it is quite likely that an index scan of urls will be the basis for the query plan chosen by the planner. It is then essential to have an index on hosts.host to make the hosts lookups efficient; the majority of this big table will never be accessed at all.
1) Here is where we assume that the projects_id is reasonably selective (that it is not the same value throughout the whole table).
Since database data is organized in 8k pages in a B-tree, and likewise for PK information information, it should be possible for each table in the database to calculate the height of the B-Tree. Thus revealing how many jumps it takes to reach certain data.
Since both row size and PK size is of great importance, it is difficult to calculate since eg varchar(250) need not take up 250 bytes.
1) Is there a way to get the info out of SQL Server?
2) if not, is it possible to give a rough estimate using some code analyzing the tables of the db?
YES! Of course!
Check out the DMV = dynamic management views in SQL Server - they contain a treasure trove of information about your indices. The dm_db_index_physical_stats is particularly useful for looking at index properties...
If you run this query in AdventureWorks against the largest table - Sales.SalesOrderDetails with over 200'000 rows - you'll get some data:
SELECT
index_depth,
index_level,
record_count,
avg_page_space_used_in_percent,
min_record_size_in_bytes,
max_record_size_in_bytes,
avg_record_size_in_bytes
FROM
sys.dm_db_index_physical_stats(DB_ID(), OBJECT_ID('Sales.SalesOrderDetail'), 1, NULL, 'DETAILED')
You'll get output for all index levels - so you'll see at one glance how many levels there are in the index (I have three rows -> three levels in the index). Index Level 0 is always the leaf level - where in the clustered index (index_id = 1) you have your actual data pages.
You can see the average, minimum and maximum record sizes in byte and a great deal of additional info - read up on DMV's, there's a great way to diagnose and peek into the inner workings of SQL Server!
try this:
SELECT INDEXPROPERTY(OBJECT_ID('table_name'), 'index_name', 'IndexDepth')
from How many elements can be held in a B-tree of order n?
(Average Number of elements that fit in one Leaf-node) * n ^ (depth - 1)