Assign proper fill factor option for each index

Assign proper fill factor option for each index - sql-server

I use SQL Server and want to assign proper fill factor value for each indexes. I know below parameter for each index:
Row count of each table
Amount of Scan occurred for each index
Amount of Seek occurred for each index
Amount of lookup occurred for each index.
Amount of update occurred for each index.
I know that scan, seek and lookup raise fill factor value to 100 and update down fill factor to 0. but I look for a formula for calculate proper fill factor option according to above parameter of each table.
EDIT
I use below script to get above parameters :
select SCHEMA_NAME(B.schema_id)+'.'+B.name+' \ '+C.name AS IndexName,
A.user_scans,
A.user_seeks,
A.user_lookups,
A.user_updates,
D.rowcnt,
C.fill_factor
from sys.dm_db_index_usage_stats A
INNER JOIN sys.objects B ON A.object_id = B.object_id
INNER JOIN sys.indexes C ON C.object_id = B.object_id AND C.index_id = A.index_id
INNER JOIN sys.sysindexes D ON D.id = B.object_id AND D.indid = A.index_id
Edit 2
I use below reference for best value for fill factor option :
Best value for fill factor 1
Best value for fill factor 2

I would use the technique described by Kendra Little from Brent Ozar Unlimited.
Here is the article. She describes her methodology for finding and addressing fill factor issues.
Also as Remus mentioned in his comments, you should use discretion when messing with the fill factor. I realize a lot of articles on the internet make it sound as though a high fill factor will cause a innumerable page splits and ruin your performance, but lowering the fill factor can cause more problems than it solves.
Kendra suggests using the default fill factor and tracking fragmentation over time, and only when an index appears to have a fragmentation issue due to page splits, should you slowly decrease the fill factor. I've been using this technique and I've noticed a much better use of my cache because of how much less my indexes are needlessly inflated.
"I frequently find that people have put a fillfactor setting of 80 or below on all the indexes in a database. This can waste many GB of space on disk and in memory. This wasted space causes extra trips to storage, and the whole thing drags down the performance of your queries."
Check out this quote in Books Online: “For example, a fill factor value of 50 can cause database read performance to decrease by two times. “
So in a nice way of saying it. I'm not sure that you should start needlessly messing with the fill factor. Observer, study, then act.

Related

Database merge join cost evaluation problem

I got a question to evaluate the minimum page I/O costs for query πA,B,C,D(R ⋈A=C S) by using merge join method. I need to evaluate followings:
Page I/O cost to sort R.
Page I/O cost to sort S.
Page I/O cost to join R and S.
My question is that, since the question has a projection on sub-set of attributes (A,B,C,D) only, is it possible to eliminate the unwanted attribute during separate sort of R and S (Provided that A and B are in R, C and D are in S)? If can then the formula of "2Br([log M-1(br/M)]+1)" seems can't apply directly.
Or more precisely, when to eliminate the unwanted attribute is the best practice?
This question stuck me a long time. Hope to get some insight on this.
Thanks.

Why is hash index slower using "Less than" in SQL

I've finished my first semester in a college-level SQL course where we used "SQL queries for Mere Mortals" 3rd edition.
Long term I want to work in data governance or as a data scientist, so digging deeper is needed and I found the Stanford SQL course. Today taking the first mini quiz, I got the answers right but on these two I'm not understanding WHY I got the answers right.
My 'SQL for Mere Mortals' book doesn't even cover hash or tree-based indexes so I've been searching online for them.
I mostly guessed based on what she said but it feels more like luck than "I solidly understand why". So I've ordered "Introduction to Algorithms" 3rd edition by Thomas Cormen and it arrived last week but it will take me a while to read through all 1,229 pages.
Found that book in this other stackoverflow link =>https://stackoverflow.com/questions/66515417/why-is-hash-function-fast
Stanford Course => https://www.edx.org/course/databases-5-sql
I thought a hash index on College.enrollment would not speed up because they limit it to less than a number vs an actual number ?? I'm guessing per this link Better to use "less than equal" or "in" in sql query that the query would be faster if we used "<=" rather than "<" ?
This one was just a process of elimination as it mentions the first item after the WHERE clause, but then was confusing as it mentions the last part of Apply.cName = College.cName.
My questions:
I'm guessing that similar to algebra having numerators and denominators, quotients, and many other terms that specifically describe part of an equation using technical terms. How would you use technical terms to describe why these answers are correct.
On the second question, why is the first part of the second line referenced and the last part of the same line referenced as the answers. Why didn't they pick the first part of each of the last part of each?
For context, most of my SQL queries are written for PostgreSQL now within PyCharm on python but I do a lot of practice using the PgAgmin4 or MySqlWorkbench desktop platforms.
I welcome any recommendations you have on paper books or pdf's that have step-by-step tutorials as many, many websites have holes or reference technical details that are confusing.
Thanks

1. A hash index is only useful for equality matches, whereas a tree index can be used for inequality (< or >= etc).
With this in mind, College.enrollment < 5000 cannot use a hash index, as it is an inequality. All other options are exact equality matches.
This is why most RDBMSs only let you create tree-based indexes.
2. This one is pretty much up in the air.
"the first item after the WHERE clause" is not relevant. Most RDBMSs will reorder the joins and filters as they see fit in order to match indexes and table statistics.
I note that the query as given is poorly written. It should use proper JOIN syntax, which is much clearer, and has been in use for 30 years already.
SELECT * -- you should really specify exact columns
FROM Student AS s -- use aliases
JOIN [Apply] AS a ON a.sID = s.sID -- Apply is a reserved keyword in many RDBMS
JOIN College AS c ON c.cName = a.aName
WHERE s.GPA > 1.5 AND c.cName < 'Cornell';
Now it's hard to say what a compiler would do here. A lot depends on the cardinalities (size of tables) in absolute terms and relative to each other, as well as the data skew in s.GPA and c.cName.
It also depends on whether secondary key (or indeed INCLUDE) columns are added, this is clearly not being considered.
Given the options for indexes you have above, and no other indexes (not realistic obviously), we could guesstimate:
Student.sID, College.cName
This may result in an efficient backwards scan on College starting from 'Cornell', but Apply would need to be joined with a hash or a naive nested loop (scanning the index each time).
The index on Student would mean an efficient nested loop with an index seek.
Student.sID, Student.GPA
Is this one index or two? If it's two separate indexes, the second will be used, and the first is obviously going to be useless. Apply and College will still need heavy joins.
Apply.cName, College.cName
This would probably get you a merge-join on those two columns, but Student would need a big join.
Apply.sID, Student.GPA
Student could be efficiently scanned from 1.5, and Apply could be seeked, but College requires a big join.
Of these options, the first or the last is probably better, but it's very hard to say without further info.
In a real system, I would have indexes on all tables, and use INCLUDE columns wisely in order to avoid key-lookups. You would want to try to get a better feel for which tables are the ones that need to be filtered early etc.

First question
A hash-index is not linearly-searchable (see Slide 7), that is, you cannot perform range-comparisons with a hash-index. This is because (in general terms) hash functions are one-way: given the output of a hash function you cannot determine the input, and the output will be in apparently random order (having a random order is good for ensuring an even load over the set of hashtable bins).
Now, for a contrived and oversimplified example:
Supposing you have these rows:
PK | Enrollment
----------------
1 | 1
2 | 10
3 | 100
4 | 1000
5 | 10000
A perfect hash index of this table would look something like this:
Assuming that the hash of 1 is 0xF822AA896F34253E and the hash of 10 is 0xB383A8BBDAA41F98, and so on...
EnrollmentHash | PhysicalRowPointer
---------------------------------------
0xF822AA896F34253E | 1
0xB383A8BBDAA41F98 | 2
0xA60DCD4E78869C9C | 3
0x49B0AF769E6B1EB3 | 4
0x724FD1728666B90B | 5
So given this hashtable index, looking at the hashes you cannot determine which hash represents larger enrollment values vs. smaller values. But a hashtable index does give you O(1) lookup for single specific values, which is why it works best for discrete, non-continuous, data values, especially columns used in JOIN criteria.
Whereas a tree-hash does preserve relative ordering information about values, but with O( log n ) lookup time.
Second question
First, I need to rewrite the query to use modern JOIN syntax. The old style (using commas) has been obsolete since SQL-92 in 1992, that's almost 30 years ago.
SELECT
*
FROM
Apply
INNER JOIN Student ON Student.sID = Apply.sID
INNER JOIN College ON Apply.cName = Apply.cName
WHERE
Student.GPA > 1.5
AND
College.cName < 'Cornell'
Now, generally speaking the best way to answer this kind of question would be to know what the STATISTICS (cardinality, value distribution, etc) of the tables are. But without that I can still make some guesses.
I assume that College is the smallest table (~500 rows?), Student will have maybe 1-2m rows, and assuming every Student makes 4-5 applications then the Apply table will have ~5m rows.
...armed with that inference, we can deduce:
Student.sID = Apply.sID is an ID match - so a hash-index would be better in most cases (excepting if the PK clustering matters, but I won't digress).
Student.GPA > 1.5 - this is a range search so having a tree-based index here helps.
College.cName < 'Cornell' - again, this is a range comparison so a tree-based index here helps too.
So the best indexes would be Student.GPA and College.cName, but that isn't an option - so let's see what the benefits of each option are...
(As I was writing this, I saw that #charlieface posted their answer which already covers this, so I'll just link to theirs to save my time: https://stackoverflow.com/a/67829326/159145 )

CASE statement versus temporary table

I have my choice between two different techniques in converting codes to text:
insert into #TMP_CONVERT(code, value)
(1, 'Uno')
,(2, 'Dos')
,(3, 'Tres')
;
coalesce(tc.value, 'Unknown') as THE_VALUE
...
LEFT OUTER JOIN #TMP_CONVERT tc
on tc.code = x.code
Or
case x.code
when 1 then 'Uno'
when 2 then 'Dos'
when 3 then 'Tres'
else 'Unknown'
end as THE_VALUE
The table has about 20 million rows.
Typical size the of the code lookup table is 10 rows.
I would rather to #1 but I don't like left outer joins.
My questions are:
Is one faster than the other in any really meaningful way?
Does any SQL engine optimize this out anyway? That is: It just reads the table into memory essentially does the case statement logic anyway?
I happen to be using tsql, but I would like to know for any number of RDBM systems because I use several.
[Edit to clarify not liking LEFT OUTER JOIN]
I use LEFT OUTER JOINS when I need them, but whenever I use them I double check my logic and data to confirm I actually need them. Then I add a comment to the code that indicates why I am using a LEFT OUTER JOIN. Of course I have to do a similar exercise when I use INNER JOIN; that is: make sure I am am not dropping data.

There is some overhead to using a join. However, if the code is made a primary clustered key, then the performance for the two might be comparable -- with the join even winning out. Without an index, I would expect the case to be slightly better than the left join.
These are just guesses. As with all performance questions, though, you should check on your data on your systems.
I also want to react to your not liking left joins. These provide important functionality for SQL and are the right way to address this problem.

The code executed for the join will likely be substantially more than the code executed for the hardcoded case options.
The execution plan will have an extra join iterator along with an extra scan or seek operator (depending on the availability of a suitable index). On the positive side likely the 10 rows will all fit on a single page in #TMP_CONVERT and this will be in memory anyway, also being a temp table it won't bother with taking and releasing row locks each time, but still the code to latch the page, locate the correct row, and crack the desired column value out of it over 20,000,000 iterations would likely add some amount of measurable CPU time compared with looking up in a hardcoded list of values (potentially you could try nested CASE statements too, to perform a binary search and avoid the need for 10 branches there).
But even if there is a measurable time difference it still may not be particularly significant as a proportion of the query time as a whole. Test it. Let us know what you find...

You can also avoid creating temporary table in this case by using with construction. So you query might be something like this.
WITH TMP_CONVERT(code,value) AS -- Semicolon can be required before WITH.
(
SELECT * FROM (VALUES (1,'UNO'),
(2,'DOS'),
(3,'Tres')
) tbl(code,value)
)
coalesce(tc.value, 'Unknown') as THE_VALUE
...
LEFT OUTER JOIN TMP_CONVERT tc
on tc.code = x.code
OR even sub query can be used :
coalesce(tc.value, 'Unknown') as THE_VALUE
...
LEFT OUTER JOIN (VALUES (1,'UNO'),
(2,'DOS'),
(3,'Tres')
) tc(code,value)
ON tc.code = x.code
Hope this can be helpful.

General Big-Data principles for finding pairs of similar objects - "fuzzy inner join"

Firstly, sorry for the vague title and if this question has been asked before, but I was not entirely sure how to phrase it.
I am looking for general design principles for finding pairs of 'similar' objects from two different data sources.
Lets for simplicity say that we have two databases, A and B, both containing large volumes of objects, each with time-stamp and geo-location, along with some other data that we don't care about here.
Now I want to perform a search along these lines:
Within as certain time-frame and location dictated as search tiem, find pairs of objects from A and B respectively, ordered by some similarity score. Here for example some scalar 'time/space distance' function, distance(a,b), that calculates the distance in time and space between the objects.
I am expecting to get a (potentially ginormous) set of results where the first result is a pair of data points which has the minimum 'distance'.
I realize that the full search space is cardinality(A) x cardinality(B).
Are there any general guidelines on how to do this in a reasonable efficient way? I assume that I would need to replicate the two databases into a common repository like Hadoop? But then what? I am not sure how to perform such a query in Hadoop either.
What is this this type of query called?
To me, this is some kind of "fuzzy inner join" that I struggle wrapping my head around how to construct, let along efficiently at scale.

SQL joins don't have to be based on equality. You can use ">", "<", "BETWEEN".
You can even do something like this:
select a.val aval, b.val bval, a.val - b.val diff
from A join B on abs(a.val - b.val) < 100

What you need is a way to divide your objects into buckets in advance, without comparing them (or at least making a linear, rather than square, number of comparisons). That way, at query time, you will only be comparing a small number of items.
There is no "one-size-fits-all" way to bucket your items. In your case the bucketing can be based on time, geolocation, or both. Time-based bucketing is very natural, and can also scales elastically (increase or decrease the bucket size). Geo-clustering buckets can be based on distance from a particular point in space (if the space is abstract), or on some finite division of the space (for example, if you divide the entire Earth's world map into tiles, which can also scale nicely if done right).
A good question to ask is "if my data starts growing rapidly, can I handle it by just adding servers?" If not, you might need to rethink the design.

Avoid huge intermediate query join result (1.33615E+35 rows)

I need a query to prevent a join that produces 1.34218E+35 results!
I have a table item (approx 8k items; e.g. Shield of Foo, Weapon of Bar), and each item is one of 9 different item_type (Armor, Weapon, etc). Each item has multiple entries in item_attribute (e.g. Damage, Defense). Here is a pseudo-code representation:
Table item (
item_id autoincrement,
...
item_type_id char, --- e.g. Armor, Weapon, etc
level int --- Must be at least this level to wear this item
);
Table item_attribute (
item_id int references item(item_id),
...
attribute char --- e.g. Damage, Defense, etc
amount int --- e.g. 100
)
Now, a character wears 9 total items at once (one each of Armor, Weapon, Shield, etc) that I call a setup. I want to build a list of setups that maximizes an attribute, but has a minimum of another attribute. In example terms: for a character level 100, present the top 10 setups by damage where sum(defense of all items) >= 100.
The naïve approach is:
select top 10
q1.item_id,q2.item_id,q3.item_id,..., q1.damage+q2.damage+q3.damage... as damage
from
(select item_id from item where item_type = 'Armor'
and level <= 100) as q1
inner join (select item_id from item where item_type = 'Shield'
and level <= 100) as q2 on 1 = 1
inner join (select item_id from item where item_type = 'Weapon'
and level <= 100) as q3 on 1 = 1
...
where
q1.defense+q2.defense+q3.defense+... >= 100
order by
q1.damage+q2.damage+q3.damage,... descending
But, because there are approx 8k items in item, that means the magnitude of results for the DBMS to sort through is close to 8000^9 = 1.34218E+35 different setups! Is there a better way?

I think your problem can be solved using integer linear programming. I'd suggest pulling your data out of the database and giving it to one of the highly optimized solvers that have been written by people who have spent a long time working on their algorithms, rather than trying to write your own solver in SQL.

Can't you join with only the # most powerful items? Should reduce the collection size drastically. Logically the sum of the highest items should deliver the highest combinations.

The first thing I would do is isolate your items. Instead of looking at the setup as a whole, look at the sum of the individual items. Unless your items interact with eachother (set bonuses) you're going to go a long way by merely maximizing stat A and minimizing stat B for just one slot, and repeating that process for each item slot in your setup. This will drastically reduce the complexity of the query, even if it will mean more queries. It should make things faster in the long run.
Another thing to ask yourself is how much is it worth gaining stat B (the one you want to lose) to gain stat A? If you could gain 1000 A, and only have to gain 1 B, that might be worth it. But, what about gaining 10 A, but you'd have to gain 9 B to do it? Now things change a bit.
If you stick with a A:B ratio, you could probably do each slot separately, and join each of those separate results into one query.