I need a query to prevent a join that produces 1.34218E+35 results!
I have a table item (approx 8k items; e.g. Shield of Foo, Weapon of Bar), and each item is one of 9 different item_type (Armor, Weapon, etc). Each item has multiple entries in item_attribute (e.g. Damage, Defense). Here is a pseudo-code representation:
Table item (
item_id autoincrement,
...
item_type_id char, --- e.g. Armor, Weapon, etc
level int --- Must be at least this level to wear this item
);
Table item_attribute (
item_id int references item(item_id),
...
attribute char --- e.g. Damage, Defense, etc
amount int --- e.g. 100
)
Now, a character wears 9 total items at once (one each of Armor, Weapon, Shield, etc) that I call a setup. I want to build a list of setups that maximizes an attribute, but has a minimum of another attribute. In example terms: for a character level 100, present the top 10 setups by damage where sum(defense of all items) >= 100.
The naïve approach is:
select top 10
q1.item_id,q2.item_id,q3.item_id,..., q1.damage+q2.damage+q3.damage... as damage
from
(select item_id from item where item_type = 'Armor'
and level <= 100) as q1
inner join (select item_id from item where item_type = 'Shield'
and level <= 100) as q2 on 1 = 1
inner join (select item_id from item where item_type = 'Weapon'
and level <= 100) as q3 on 1 = 1
...
where
q1.defense+q2.defense+q3.defense+... >= 100
order by
q1.damage+q2.damage+q3.damage,... descending
But, because there are approx 8k items in item, that means the magnitude of results for the DBMS to sort through is close to 8000^9 = 1.34218E+35 different setups! Is there a better way?
I think your problem can be solved using integer linear programming. I'd suggest pulling your data out of the database and giving it to one of the highly optimized solvers that have been written by people who have spent a long time working on their algorithms, rather than trying to write your own solver in SQL.
Can't you join with only the # most powerful items? Should reduce the collection size drastically. Logically the sum of the highest items should deliver the highest combinations.
The first thing I would do is isolate your items. Instead of looking at the setup as a whole, look at the sum of the individual items. Unless your items interact with eachother (set bonuses) you're going to go a long way by merely maximizing stat A and minimizing stat B for just one slot, and repeating that process for each item slot in your setup. This will drastically reduce the complexity of the query, even if it will mean more queries. It should make things faster in the long run.
Another thing to ask yourself is how much is it worth gaining stat B (the one you want to lose) to gain stat A? If you could gain 1000 A, and only have to gain 1 B, that might be worth it. But, what about gaining 10 A, but you'd have to gain 9 B to do it? Now things change a bit.
If you stick with a A:B ratio, you could probably do each slot separately, and join each of those separate results into one query.
Related
I've finished my first semester in a college-level SQL course where we used "SQL queries for Mere Mortals" 3rd edition.
Long term I want to work in data governance or as a data scientist, so digging deeper is needed and I found the Stanford SQL course. Today taking the first mini quiz, I got the answers right but on these two I'm not understanding WHY I got the answers right.
My 'SQL for Mere Mortals' book doesn't even cover hash or tree-based indexes so I've been searching online for them.
I mostly guessed based on what she said but it feels more like luck than "I solidly understand why". So I've ordered "Introduction to Algorithms" 3rd edition by Thomas Cormen and it arrived last week but it will take me a while to read through all 1,229 pages.
Found that book in this other stackoverflow link =>https://stackoverflow.com/questions/66515417/why-is-hash-function-fast
Stanford Course => https://www.edx.org/course/databases-5-sql
I thought a hash index on College.enrollment would not speed up because they limit it to less than a number vs an actual number ?? I'm guessing per this link Better to use "less than equal" or "in" in sql query that the query would be faster if we used "<=" rather than "<" ?
This one was just a process of elimination as it mentions the first item after the WHERE clause, but then was confusing as it mentions the last part of Apply.cName = College.cName.
My questions:
I'm guessing that similar to algebra having numerators and denominators, quotients, and many other terms that specifically describe part of an equation using technical terms. How would you use technical terms to describe why these answers are correct.
On the second question, why is the first part of the second line referenced and the last part of the same line referenced as the answers. Why didn't they pick the first part of each of the last part of each?
For context, most of my SQL queries are written for PostgreSQL now within PyCharm on python but I do a lot of practice using the PgAgmin4 or MySqlWorkbench desktop platforms.
I welcome any recommendations you have on paper books or pdf's that have step-by-step tutorials as many, many websites have holes or reference technical details that are confusing.
Thanks
1. A hash index is only useful for equality matches, whereas a tree index can be used for inequality (< or >= etc).
With this in mind, College.enrollment < 5000 cannot use a hash index, as it is an inequality. All other options are exact equality matches.
This is why most RDBMSs only let you create tree-based indexes.
2. This one is pretty much up in the air.
"the first item after the WHERE clause" is not relevant. Most RDBMSs will reorder the joins and filters as they see fit in order to match indexes and table statistics.
I note that the query as given is poorly written. It should use proper JOIN syntax, which is much clearer, and has been in use for 30 years already.
SELECT * -- you should really specify exact columns
FROM Student AS s -- use aliases
JOIN [Apply] AS a ON a.sID = s.sID -- Apply is a reserved keyword in many RDBMS
JOIN College AS c ON c.cName = a.aName
WHERE s.GPA > 1.5 AND c.cName < 'Cornell';
Now it's hard to say what a compiler would do here. A lot depends on the cardinalities (size of tables) in absolute terms and relative to each other, as well as the data skew in s.GPA and c.cName.
It also depends on whether secondary key (or indeed INCLUDE) columns are added, this is clearly not being considered.
Given the options for indexes you have above, and no other indexes (not realistic obviously), we could guesstimate:
Student.sID, College.cName
This may result in an efficient backwards scan on College starting from 'Cornell', but Apply would need to be joined with a hash or a naive nested loop (scanning the index each time).
The index on Student would mean an efficient nested loop with an index seek.
Student.sID, Student.GPA
Is this one index or two? If it's two separate indexes, the second will be used, and the first is obviously going to be useless. Apply and College will still need heavy joins.
Apply.cName, College.cName
This would probably get you a merge-join on those two columns, but Student would need a big join.
Apply.sID, Student.GPA
Student could be efficiently scanned from 1.5, and Apply could be seeked, but College requires a big join.
Of these options, the first or the last is probably better, but it's very hard to say without further info.
In a real system, I would have indexes on all tables, and use INCLUDE columns wisely in order to avoid key-lookups. You would want to try to get a better feel for which tables are the ones that need to be filtered early etc.
First question
A hash-index is not linearly-searchable (see Slide 7), that is, you cannot perform range-comparisons with a hash-index. This is because (in general terms) hash functions are one-way: given the output of a hash function you cannot determine the input, and the output will be in apparently random order (having a random order is good for ensuring an even load over the set of hashtable bins).
Now, for a contrived and oversimplified example:
Supposing you have these rows:
PK | Enrollment
----------------
1 | 1
2 | 10
3 | 100
4 | 1000
5 | 10000
A perfect hash index of this table would look something like this:
Assuming that the hash of 1 is 0xF822AA896F34253E and the hash of 10 is 0xB383A8BBDAA41F98, and so on...
EnrollmentHash | PhysicalRowPointer
---------------------------------------
0xF822AA896F34253E | 1
0xB383A8BBDAA41F98 | 2
0xA60DCD4E78869C9C | 3
0x49B0AF769E6B1EB3 | 4
0x724FD1728666B90B | 5
So given this hashtable index, looking at the hashes you cannot determine which hash represents larger enrollment values vs. smaller values. But a hashtable index does give you O(1) lookup for single specific values, which is why it works best for discrete, non-continuous, data values, especially columns used in JOIN criteria.
Whereas a tree-hash does preserve relative ordering information about values, but with O( log n ) lookup time.
Second question
First, I need to rewrite the query to use modern JOIN syntax. The old style (using commas) has been obsolete since SQL-92 in 1992, that's almost 30 years ago.
SELECT
*
FROM
Apply
INNER JOIN Student ON Student.sID = Apply.sID
INNER JOIN College ON Apply.cName = Apply.cName
WHERE
Student.GPA > 1.5
AND
College.cName < 'Cornell'
Now, generally speaking the best way to answer this kind of question would be to know what the STATISTICS (cardinality, value distribution, etc) of the tables are. But without that I can still make some guesses.
I assume that College is the smallest table (~500 rows?), Student will have maybe 1-2m rows, and assuming every Student makes 4-5 applications then the Apply table will have ~5m rows.
...armed with that inference, we can deduce:
Student.sID = Apply.sID is an ID match - so a hash-index would be better in most cases (excepting if the PK clustering matters, but I won't digress).
Student.GPA > 1.5 - this is a range search so having a tree-based index here helps.
College.cName < 'Cornell' - again, this is a range comparison so a tree-based index here helps too.
So the best indexes would be Student.GPA and College.cName, but that isn't an option - so let's see what the benefits of each option are...
(As I was writing this, I saw that #charlieface posted their answer which already covers this, so I'll just link to theirs to save my time: https://stackoverflow.com/a/67829326/159145 )
In my program you can book an item. This item has an id with 6 characters from 32 possible characters.
So my possibilities are 32^6. Every id must be unique.
func tryToAddItem {
if !db.contains(generateId()) {
addItem()
} else {
tryToAddItem()
}
}
For example 90% of my ids are used. So the probability that I call tryToAddItem 5 times is 0,9^5 * 100 = 59% isn't it?
So that is quite high. This are 5 database queries on a lot of datas.
When the probability is so high I want to implement a prefix „A-xxxxxx“.
What is a good condition for that? At which time do I will need a prefix?
In my example 90% ids were use. What is about the rest? Do I threw it away?
What is about database performance when I call tryToAddItem 5 times? I could imagine that this is not best practise.
For example 90% of my ids are used. So the probability that I call tryToAddItem 5 times is 0,9^5 * 100 = 59% isn't it?
Not quite. Let's represent the number of call you make with the random variable X, and let's call the probability of an id collision p. You want the probability that you make the call at most five times, or in general at most k times:
P(X≤k) = P(X=1) + P(X=2) + ... + P(X=k)
= (1-p) + (1-p)*p + (1-p)*p^2 +... + (1-p)*p^(k-1)
= (1-p)*(1 + p + p^2 + .. + p^(k-1))
If we expand this out all but two terms cancel and we get:
= 1- p^k
Which we want to be greater than some probability, x:
1 - p^k > x
Or with p in terms of k and x:
p < (1-x)^(1/k)
where you can adjust x and k for your specific needs.
If you want less than a 50% probability of 5 or more calls, then no more than (1-0.5)^(1/5) ≈ 87% of your ids should be taken.
First of all make sure there is an index on the id columns you are looking up. Then I would recommend thinking more in terms of setting a very low probability of a very bad event occurring. For example maybe making 20 calls slows down the database for too long, so we'd like to set the probability of this occurring to <0.1%. Using the formula above we find that no more than 70% of ids should be taken.
But you should also consider alternative solutions. Is remapping all ids to a larger space one time only a possibility?
Or if adding ids with prefixes is not a big deal then you could generate longer ids with prefixes for all new items going forward and not have to worry about collisions.
Thanks for response. I searched for alternatives and want show three possibilities.
First possibility: Create an UpcomingItemIdTable with 200 (more or less) valid itemIds. A task in the background can calculate them every minute (or what you need). So the action tryToAddItem will always get a valid itemId.
Second possibility
Is remapping all ids to a larger space one time only a possibility?
In my case yes. I think for other problems the answer will be: it depends.
Third possibility: Try to generate an itemId and when there is a collision try it again.
Possible collisions handling: Do some test before. Measure the time to generate itemIds when there are already 1000,10.000,100.000,1.000.000 etc. entries in the table. When the tryToAddItem method needs more than 100ms (or what you prefer) then increase your length from 6 to 7,8,9 characters.
Some thoughts
every request must be atomar
create an index on itemId
Disadvantages for long UUIDs in API: See https://zalando.github.io/restful-api-guidelines/#144
less usable, because...
-cannot be memorized and easily communicated by humans
-harder to use in debugging and logging analysis
-less convenient for consumer facing usage
-quite long: readable representation requires 36 characters and comes with higher memory and bandwidth consumption
-not ordered along their creation history and no indication of used id volume
-may be in conflict with additional backward compatibility support of legacy ids
[...]
TLDR: For my case every possibility is working. As so often it depends on the problem. Thanks for input.
I have a large table (~1M rows now, soon ~10M) that has two ranked columns (in addition to the regular data):
avg_visited, a float 0-1 representing a %age popularity; higher is better
alexa_rank, an integer 1-N giving an a priori ranking
The a priori ranking is from external sources so can't be changed. Many rows have no popularity yet (as no user has yet hit it), so the a priori ranking is the fallback ordering. The popularity however does change very frequently - both to update old entries and to add a popularity to ones that previously only had the a priori ranking, if some user actually hits it.
I frequently run SELECT id, url, alexa_rank, avg_visited FROMsitesORDER BY avg_visited desc, alexa_rank asc LIMIT 49500, 500 (for various values of 49500).
However, ORDER BY cannot use an index with mixed ascendency per http://dev.mysql.com/doc/refman/5.0/en/order-by-optimization.html
This is in mysql 5.1, innodb.
How can I best change this situation to give me a sane, fully indexed query?
Unfortunately, MySQL does not support DESC clauses in the indexes, neither does it support indexes on derived expressions.
You can store the negative popularity along with the positive one and use it in the ORDER BY:
CREATE INDEX ix_mytable_negpopularity_apriori ON (neg_popularity, a_priori);
INSERT
INTO mytable (popularity, neg_popularity)
VALUES (#popularity, -#popularity);
SELECT *
FROM mytable
ORDER BY
neg_popularity, a_priori
Just a simple hack:
Since since popularity is a float between 0 to 1. You can multiply it by -1 and the number will be between -1 to 0.
This way you can reverse the sort order of popularity to ORDER BY popularity ASC, a_priori ASC
Not sure the overhead out weighs the gain.
This reminds me of the hack of storing emails in reverse form.
I'm trying to write a GQL query that returns N random records of a specific kind. My current implementation works but requires N calls to the datastore. I'd like to make it 1 call to the datastore if possible.
I currently assign a random number to every kind that I put into the datastore. When I query for a random record I generate another random number and query for records > rand ORDER BY asc LIMIT 1.
This works, however, it only returns 1 record so I need to do N queries. Any ideas on how to make this one query? Thanks.
"Under the hood" a single search query call can only return a set of consecutive rows from some index. This is why some GQL queries, including any use of !=, expand to multiple datastore calls.
N independent uniform random selections are not (in general) consecutive in any index.
QED.
You could probably use memcache to store the entities, and reduce the cost of grabbing N of them. Or if you don't mind the "random" selections being close together in the index, select a randomly-chosen block of (say) 100 in one query, then pick N at random from those. Since you have a field that's already randomised, it won't be immediately obvious to an outsider that the N items are related. At least, not until they look at a lot of samples and notice that items A and Z never appear in the same group, because they're more than 100 apart in the randomised index. And if performance permits, you can re-randomise your entities from time to time.
What kind of tradeoffs are you looking for? If you are willing to put up with a small performance hit on inserting these entities, you can create a solution to get N of them very quickly.
Here's what you need to do:
When you insert your Entities, specify the key. You want to give keys to your entities in order, starting with 1 and going up from there. (This will require some effort, as app engine doesn't have autoincrement() so you'll need to keep track of the last id you used in some other entity, let's call it an IdGenerator)
Now when you need N random entities, generate N random numbers between 1 and whatever the last id you generated was (your IdGenerator will know this). You can then do a batch get by key using the N keys, which will only require one trip to the datastore, and will be faster than a query as well, since key gets are generally faster than queries, AFAIK.
This method does require dealing with a few annoying details:
Your IdGenerator might become a bottleneck if you are inserting lots of these items on the fly (more than a few a second), which would require some kind of sharded IdGenerator implementation. If all this data is preloaded, or is not high volume, you have it easy.
You might find that some Id doesn't actually have an entity associated with it anymore, because you deleted it or because a put() failed somewhere. If this happened you'd have to grab another random entity. (If you wanted to get fancy and reduce the odds of this you could make this Id available to the IdGenerator to reuse to "fill in the holes")
So the question comes down to how fast you need these N items vs how often you will be adding and deleting them, and whether a little extra complexity is worth that performance boost.
Looks like the only method is by storing the random integer value in each entity's special property and querying on that. This can be done quite automatically if you just add an automatically initialized property.
Unfortunately this will require processing of all entities once if your datastore is already filled in.
It's weird, I know.
I agree to the answer from Steve, there is no such way to retrieve N random rows in one query.
However, even the method of retrieving one single entity does not usually work such that the prbability of the returned results is evenly distributed. The probability of returning a given entity depends on the gap of it's randomly assigned number and the next higher random number. E.g. if random numbers 1,2, and 10 have been assigned (and none of the numbers 3-9), the algorithm will return "2" 8 times more often than "1".
I have fixed this in a slightly more expensice way. If someone is interested, I am happy to share
I just had the same problem. I decided not to assign IDs to my already existing entries in datastore and did this, as I already had the totalcount from a sharded counter.
This selects "count" entries from "totalcount" entries, sorted by key.
# select $count from the complete set
numberlist = random.sample(range(0,totalcount),count)
numberlist.sort()
pagesize=1000
#initbuckets
buckets = [ [] for i in xrange(int(max(numberlist)/pagesize)+1) ]
for k in numberlist:
thisb = int(k/pagesize)
buckets[thisb].append(k-(thisb*pagesize))
logging.debug("Numbers: %s. Buckets %s",numberlist,buckets)
#page through results.
result = []
baseq = db.Query(MyEntries,keys_only=True).order("__key__")
for b,l in enumerate(buckets):
if len(l) > 0:
result += [ wq.fetch(limit=1,offset=e)[0] for e in l ]
if b < len(buckets)-1: # not the last bucket
lastkey = wq.fetch(1,pagesize-1)[0]
wq = baseq.filter("__key__ >",lastkey)
Beware that this to me is somewhat complex, and I'm still not conviced that I dont have off-by-one or off-by-x errors.
And beware that if count is close to totalcount this can be very expensive.
And beware that on millions of rows it might not be possible to do within appengine time boundaries.
If I understand correctly, you need retrieve N random instance.
It's easy. Just do query with only keys. And do random.choice N times on list result of keys. Then get results by fetching on keys.
keys = MyModel.all(keys_only=True)
n = 5 # 5 random instance
all_keys = list(keys)
result_keys = []
for _ in range(0,n)
key = random.choice(all_keys)
all_keys.remove(key)
result_keys.append(key)
# result_keys now contain 5 random keys.
Is there any way to select a subset from a large set based on a property or predicate in less than O(n) time?
For a simple example, say I have a large set of authors. Each author has a one-to-many relationship with a set of books, and a one-to-one relationship with a city of birth.
Is there a way to efficiently do a query like "get all books by authors who were born in Chicago"? The only way I can think of is to first select all authors from the city (fast with a good index), then iterate through them and accumulate all their books (O(n) where n is the number of authors from Chicago).
I know databases do something like this in certain joins, and Endeca claims to be able to do this "fast" using what they call "Record Relationship Navigation", but I haven't been able to find anything about the actual algorithms used or even their computational complexity.
I'm not particularly concerned with the exact data structure... I'd be jazzed to learn about how to do this in a RDBMS, or a key/value repository, or just about anything.
Also, what about third or fourth degree requests of this nature? (Get me all the books written by authors living in cities with immigrant populations greater than 10,000...) Is there a generalized n-degree algorithm, and what is its performance characteristics?
Edit:
I am probably just really dense, but I don't see how the inverted index suggestion helps. For example, say I had the following data:
DATA
1. Milton England
2. Shakespeare England
3. Twain USA
4. Milton Paridise Lost
5. Shakespeare Hamlet
6. Shakespeare Othello
7. Twain Tom Sawyer
8. Twain Huck Finn
INDEX
"Milton" (1, 4)
"Shakespeare" (2, 5, 6)
"Twain" (3, 7, 8)
"Paridise Lost" (4)
"Hamlet" (5)
"Othello" (6)
"Tom Sawyer" (7)
"Huck Finn" (8)
"England" (1, 2)
"USA" (3)
Say I did my query on "books by authors from England". Very quickly, in O(1) time via a hashtable, I could get my list of authors from England: (1, 2). But then, for the next step, in order retrieve the books, I'd have to, for EACH of the set {1, 2}, do ANOTHER O(1) lookup: 1 -> {4}, 2 -> {5, 6} then do a union of the results {4, 5, 6}.
Or am I missing something? Perhaps you meant I should explicitly store an index entry linking Book to Country. That works for very small data sets. But for a large data set, the number of indexes required to match any possible combination of queries would make the index grow exponentially.
For joins like this on large data sets, a modern RDBMS will often use an algorithm called a list merge. Using your example:
Prepare a list, A, of all authors who live in Chicago and sort them by author in O(Nlog(N)) time.*
Prepare a list, B, of all (author, book name) pairs and sort them by author in O(Mlog(M)) time.*
Place these two lists "side by side", and compare the authors from the "top" (lexicographically minimum) element in each pile.
Are they the same? If so:
Output the (author, book name) pair from top(B)
Remove the top element of the B pile
Goto 3.
Otherwise, is top(A).author < top(B).author? If so:
Remove the top element of the A pile
Goto 3.
Otherwise, it must be that top(A).author > top(B).author:
Remove the top element of the B pile
Goto 3.
* (Or O(0) time if the table is already sorted by author, or has an index which is.)
The loop continues removing one item at a time until both piles are empty, thus taking O(N + M) steps, where N and M are the sizes of piles A and B respectively. Because the two "piles" are sorted by author, this algorithm will discover every matching pair. It does not require an index (although the presence of indexes may remove the need for one or both sort operations at the start).
Note that the RDBMS may well choose a different algorithm (e.g. the simple one you mentioned) if it estimates that it would be faster to do so. The RDBMS's query analyser generally estimates the costs in terms of disk accesses and CPU time for many thousands of different approaches, possibly taking into account such information as the statistical distributions of values in the relevant tables, and selects the best.
SELECT a.*, b.*
FROM Authors AS a, Books AS b
WHERE a.author_id = b.author_id
AND a.birth_city = "Chicago"
AND a.birth_state = "IL";
A good optimizer will process that in less than the time it would take to read the whole list of authors and the whole list of books, which is sub-linear time, therefore. (If you have another definition of what you mean by sub-linear, speak out.)
Note that the optimizer should be able to choose the order in which to process the tables that is most advantageous. And this applies to N-level sets of queries.
Generally speaking, RDBMSes handle these types of queries very well. Both commercial and open source database engines have evolved over decades using all the reasonable computing algorithms applicable, to do just this task as fast as possible.
I would venture a guess that the only way you would beat RDBMS in speed is, if your data is specifically organized and require specific algorithms. Some RDBSes let you specify which of the underlying algorithms you can use for manipulating data, and with open-source ones, you can always rewrite or implement a new algorithm, if needed.
However, unless your case is very special, I believe it might be a serious overkill. For most cases, I would say putting the data in RDBMS and manipulating it via SQL should work well enough so that you don't have to worry abouut underlying algorithms.
Inverted Index.
Since this has a loop, I'm sure it fails the O(n) test. However, when your result set has n rows, it's impossible to avoid iterating over the result set. The query, however, is two hash lookups.
from collections import defaultdict
country = [ "England", "USA" ]
author= [ ("Milton", "England"), ("Shakespeare","England"), ("Twain","USA") ]
title = [ ("Milton", "Paradise Lost"),
("Shakespeare", "Hamlet"),
("Shakespeare", "Othello"),
("Twain","Tom Sawyer"),
("Twain","Huck Finn"),
]
inv_country = {}
for id,c in enumerate(country):
inv_country.setdefault(c,defaultdict(list))
inv_country[c]['country'].append( id )
inv_author= {}
for id,row in enumerate(author):
a,c = row
inv_author.setdefault(a,defaultdict(list))
inv_author[a]['author'].append( id )
inv_country[c]['author'].append( id )
inv_title= {}
for id,row in enumerate(title):
a,t = row
inv_title.setdefault(t,defaultdict(list))
inv_title[t]['title'].append( id )
inv_author[a]['author'].append( id )
#Books by authors from England
for t in inv_country['England']['author']:
print title[t]