If I had an entity with an indexed property, say "name," what would the performance of == queries on that property be like?
Of course, I understand that no exact answers are possible, but how does the performance correlate with the total number of entities for which name == x for some x, the total number of entities in the datastore, etc.?
How much slower would a query on name == x be if I had 1000 entities with name equalling x, versus 100 entities? Has any sort of benchmarking been done on this?
Some not very strenuous testing on my part indicated response times increased roughly linearly with the number of results returned. Note that even if you have 1000 entities, if you add a limit=100 to your query, it'll perform the same as if you only had 100 entities.
This is in line with the documentation which indicates that perf varies with the number of entities returned.
When I say not very strenuous, I mean that the response times were all over the place, and it was a very very rough estimate to draw a line through. I'd often see an order of magnitude difference in perf on the same request.
AppEngine does queries in a very optimized way, so it is virtually irrelevant from a performance stand-point whether you do a query on the name property vs. just doing a batch-get with the keys only. Either will be linear in the number of entities returned. The total number of entities stored in your database does not make a difference. What does make a tiny difference, though, is the number of different values for "name" that occur in your database (so, 1000 entities returned will be pretty much exactly 10 times slower than 100 entities returned).
The way this is done is via the indices (or indexes as preferred) stored along with your data. An index for the "name" property consists of a table that has all names sorted in alphabetical order (and a second one sorted in reverse alphabetical order, if you use descending order in any of your queries) and a query will then simply find the first occurrence of the name you are querying in the table and start returning results in order. This is called a "scan".
This video is a bit technical, but it explains in detail how all this works and if you're concerned about coding for maximum performance, might be a good time investment:
Google I/O 2008: Under the Covers of the Google App Engine Datastore
(the video quality is fairly bad, but they also have the slides online (see link above video))
Related
Why are results from search index queries limited to 200 rows, whereas standard view queries seem to have no limit?
Fundamentally because we hold a 200 item array in memory as we stream over all hits, preserving the top 200 scoring hits. A standard view just streams all rows between a start and end point. The intent of a search is to typically to find the needle in a haystack, so you don't generally fetch thousands of results (compare with Google, who clicks through to page 500?). If you don't find what you want, you refine your search and look again.
There are cases when retrieving all matches makes sense (and we can stream this in the order we find them, so there's no RAM issue). That's a feature we can (and should) add, but it's not currently available.
It's also worth noting that the _view API (aka "mapreduce") is fundamentally different than search because of the ordering of results on disk. Materialized views are persisted in CouchDB b+ trees, so they are essentially sorted by key. That allows for efficient range queries (start/end key), and makes limit/paging trivial. However, it also means that you have to order the view rows on disk, which restricts the types of boolean queries that you can perform against the materialized views. That's where search helps, but Bob (aka "The Lucene Expert") notes the limitations.
I am unable to understand the role equi-depth histograms play in query optimization. Can someone please give me some pointers to good resources or could anyone explain. I have read a few research papers but still I could not convince my for the need and use of equi-depth histograms. So, can someone please explain equi-depth histograms with an example.
Also can we merge the buckets of the histograms so that the histogram becomes small enough and fits in 1 page on disk?
Also what are bucket boundaries in equi-depth histograms?
Caveat: I'm not an expert on database internals, so this is a general, not a specific answer.
Query compilers convert the query, usually given in SQL, to a plan for obtaining the result. Plans consist of low level "instructions" to the database engine: scan table T looking for value V in column C; use index X on table T to locate value V; etc.
Query optimization is about the compiler deciding which of a (potentially huge) set of alternative query plans have minimum cost. Costs include wall clock time, IO bandwidth, intermediate result storage space, CPU time, etc. Conceptually, the optimizer is searching the alternative plan space, evaluating the cost of each to guide the search, ultimately choosing the cheapest it can find.
The costs mentioned above depend on estimates of how many records will be read and/or written, whether the records can be located by indexes, what columns of those records will be used, and the size of the data and/or how many disk pages they occupy.
These quantities in turn often depend on the exact data values stored in the tables. Consider for example select * from data where pay > 100 where pay is an indexed column. If the pay column has no values over 100, then the query is extremely cheap. A single probe of the index answers it. Conversely the result set could contain the entire table.
This is where histograms help. (Equi-depth histograms are just one way of maintaining histograms.) In the preceeding query a histogram will in O(1) time provide an estimate of the fraction of rows that will be produced by the query without knowing exactly what those rows will contain.
In effect, the optimizer is "executing" the query on an abstraction of the data. The histogram is that abstraction. (Others are possible.) The histogram is useful for estimating costs and result sizes for query plan operations: join result size and page hits during mass insertions and deletions (which may lead to the generation of a temporary index), for example.
For a simple inner join example, suppose we know how integer-valued join columns of two tables are distributed:
Bins (25% each)
Table A Table B
0-100 151-300
101-150 301-500
151-175 601-700
176-300 1001-1100
It's easy to see that 50% of Table A and 25% of Table B reflect the possible participation. If these are unique-valued columns, then a useful join size estimate is max(.5 * |A|, .25 * |B|). This is a very simple example. In many (most?) cases, the analysis requires much more mathematical sophistication. For joins, it's usual to compute an estimated histogram of the results by "joining" the histograms of the operands. This is what makes the literature so diverse, complicated, and interesting.
PhD dissertations often have surveys that cover big bodies of technical literature like this in a concise form that isn't too difficult to read. (After all, the candidate is trying to convince a committee he/she knows how to do a literature search.) Here is one such example.
If I have a query that's structured like this:
q = Questions.all()
q.order('-votes')
results = q.run(limit=25)
And votes is just an IntegerProperty in a Questions db model, does the size/cost (basically what counts towards my quota) of the query depend on the number of entities?
Basically, if I'm trying to order 1000 Questions, is it more expensive than ordering only 10 Questions?
short answer: No.
There are read costs and write costs.
Write costs occur when you write an entity, and the big influence is the number of indexed properties per entity.
Read costs are based on the number of entities returned in a query.
If you sort on votes, you need to make sure the votes property is indexed. That's 1-2
additional writes per entity written.
Read costs vary by the number of entities returned. The filter and sort order don't affect the cost on read.
I'm working on a web application where the user provides parameters, and these are used to produce a list of the top 1000 items from a database of up to 20 million rows. I need all top 1000 items at once, and I need this ranking to happen more or less instantaneously from the perspective of the user.
Currently, I'm using a MySQL with a user-defined function to score and rank the data, then PHP takes it from there. Tested on a database of 1M rows, this takes about 8 seconds, but I need performance around 2 seconds, even for a database of up to 20M rows. Preferably, this number should be lower still, so that decent throughput is guaranteed for up to 50 simultaneous users.
I am open to any process with any software that can process this data as efficiently as possible, whether it is MySQL or not. Here are the features and constraints of the process:
The data for each row that is relevant to the scoring process is about 50 bytes per item.
Inserts and updates to the DB are negligible.
Each score is independent of the others, so scores can be computed in parallel.
Due to the large number of parameters and parameter values, the scores cannot be pre-computed.
The method should scale well for multiple simultaneous users
The fewer computing resources this requires, in terms of number of servers, the better.
Thanks
A feasible approach seems to be to load (and later update) all data into about 1GB RAM and perform the scoring and ranking outside MySQL in a language like C++. That should be faster than MySQL.
The scoring must be relatively simple for this approache because your requirements only leave a tenth of a microsecond per row for scoring and ranking without parallelization or optimization.
If you could post query you are having issue with can help.
Although here are some things.
Make sure you have indexes created on database.
Make sure to use optimized queries and using joins instead of inner queries.
Based on your criteria, the possibility of improving performance would depend on whether or not you can use the input criteria to pre-filter the number of rows for which you need to calculate scores. I.e. if one of the user-provided parameters automatically disqualifies a large fraction of the rows, then applying that filtering first would improve performance. If none of the parameters have that characteristic, then you may need either much more hardware or a database with higher performance.
I'd say for this sort of problem, if you've done all the obvious software optimizations (and we can't know that, since you haven't mentioned anything about your software approaches), you should try for some serious hardware optimization. Max out the memory on your SQL servers, and try to fit your tables into memory where possible. Use an SSD for your table / index storage, for speedy deserialization. If you're clustered, crank up the networking to the highest feasible network speeds.
I've got a rather tricky one, bear with me as I try not to stumble over my words here. I'm doing some research, and my group is transitioning to a cassandra database. Our research used MySQL before, but the data outgrew the database (192 million rows in memory # 16G -- it was the only way to query the data fast enough). The data itself is kinda-sorta static. There's a whole lot of it, but any new data is a somewhat slow trickle at this point.
The data consists of a boatload of classifier-score pairs. We formulate queries for the database which basically say, "give me the top 500 for the following classifiers". Then the database returns that many scores. For example, if we ask for the top 500 scores for 2 classifiers, we get back 1000 rows (each row consisting of a classifier ID and a score -- i.e. [4, 9100]). The scores themselves are non-uniform (the distribution tends to clump toward one end of the values -- which by the way are from -10000 to 10000)
As we transition to cassandra, there are a number of requirements. First of all, we need to be able to query for the top and bottom N scores on a per-classifier basis. Normally I can see that an ordered partitioner would be appropriate for this, however like I said the scores tends to clump at the extremes (which would put too much of a burden on one node). So my first question is, how do I evenly distribute the classifier/score pairs while still being able to query for the top or bottom N.
There is a secondary requirement which pretty much screws up the first one. Sometimes it is necessary to find all scores that are near another score. So if I see classifier 6 with a score of 400, I might ask, show me 500 scores that are the closest to that (all within classifier 6). I'm absolutely stumped about this one. I've read that cassandra supports secondary indices (yay) but only hash type (boo - no ranges). Do we create a seperate ColumnFamily for this use case?
And finally, speed is paramount. The data is being used in an interactive GUI application. Ideally, queries should only take a few seconds. And if data all gets stuck on one particular node, it will slow things down.
We've tried all kinds of clever tricks. Our best idea was to put the data into buckets, so that the top 500 went into bucket 1, the next 500 went into bucket 2, and so on. The advantage is that to get the top 500 we just ask for bucket 1. Also all of the data WOULD be evenly distributed using a random partitioner. However since MOST of our queries are interested only in bucket 1, it would put a lot of burden on just one node (remember, if N classifiers are involved, it's actually 500 * N scores per bucket). The real disadvantage of this scheme is that it falls apart when we need to query based on nearness to a score (we'd have to do some kind of weird binary search over the buckets to find our starting value).
At this point we're running low on ideas. Everything I've seen about cassandra makes me wonder if it's even appropriate for this task. We chose it mainly because of it's horizontal scalability, which is important (much easier to add a node than to shard an RDBM). So I suppose my overall question is: how would you approach this? If cassandra, please address any of the above issues. Otherwise any insight or wisdom would be appreciated. Thanks.
Why not storing the classifier as a column family row key and the score in column name. Since columns are sorted it is really fast to query the top/bottom 500 columns for a given classifier. The second type of query is also possible, when you are looking for the scores near s you can for instance select 500 columns before s and 500 columns after s and then filter the 500 columns near s.