Sphinx. How fast are Random results? - database

Does anybody have experience with getting random results from index with +100,000,000 (100 million) records.
The goal is getting 30 results ordered by random, at least 100 times per second.
Actually my records are in MySQL but selecting ORDER BY RAND() from huge tables is the most easiest way to kill MySQL.
Sphinxsearch or whatever what do you recommend?

I dont have that big an index to try.
barry#server:~/modules/sphinx-2.0.1-beta/api# time php test.php -i gi_stemmed --sortby #random --select id
Query '' retrieved 20 of 3067775 matches in 0.081 sec.
Query stats:
Matches:
<SNIP>
real 0m0.100s
user 0m0.010s
sys 0m0.010s
This is on a reasonably powerful dedicated server - that is serving live queries (~20qps)
But to be honest if you dont need filtering (ie each query has a 'WHERE' clause), you can just setup a system that returns random results - can do this with mysql. Just using ORDER BY RAND() is evil (and sphinx while better at sorting than mysql is still doing basically the same thing).
How 'sparse' is your data? If most of your ids are used, can just do soemthing like
$ids = array();
$max = getOne("SELECT MAX(id) FROM table");
foreach(range(1,30) as $idx) {
$ids[] = rand(1,$max);
}
$query = "SELECT * FROM table WHERE id IN (".implode(',',$ids).")";
(may want to use shuffle() in php on the results afterwards as you likly to get the results out of mysql in id order)
Which will be much more efficient. If you do have holes, perhaps just lookup 33 rows. Sometimes will get more than need, (just discard), but you should still get 30 most of the times.
(Of course you could cache the '$max' somewhere, so it doesnt have to be looked up all the time.)
Otherwise you could setup a dedicated 'shuffled' list. Basically a FIFO buffer, have one thread, filling it with random results (perhaps using the above system, using 3000 ids at a time) and then the consumers just read random results directly out of this queue.
FIFO, is not particully easy to implement with mysql, so maybe use a different system - maybe redis, or even just memcache.

Related

flask-sqlalchemy slow paginate count

I have a Postgres 10 database in my Flask app. I'm trying to paginate the filtering results on table over milions of rows. The problem is, that paginate method do counting total number of query results totaly ineffective.
Heres the example with dummy filter:
paginate = Buildings.query.filter(height>10).paginate(1,10)
Under the hood if perform 2 queries:
SELECT * FROM buildings where height > 10
SELECT count(*) FROM (
SELECT * FROM buildings where height > 10
)
--------
count returns 200,000 rows
The problem is that count on raw select without subquery is quite fast ~30ms, but paginate method wraps that into subquery that takes ~30s.
The query plan on cold database:
Is there an option of using default paginate method from flask-sqlalchemy in performant way?
EDIT:
To get the better understanding of my problem here is the real filter operations used in my case, but with dummy field names:
paginate = Buildings.query.filter_by(owner_id=None).filter(Buildings.address.like('%A%')).paginate(1,10)
So the SQL the ORM produce is:
SELECT count(*) AS count_1
FROM (SELECT foo_column, [...]
FROM buildings
WHERE buildings.owner_id IS NULL AND buildings.address LIKE '%A%' ) AS anon_1
That query is already optimized by indices from:
CREATE INDEX ix_trgm_buildings_address ON public.buildings USING gin (address gin_trgm_ops);
CREATE INDEX ix_buildings_owner_id ON public.buildings USING btree (owner_id)
The problem is just this count function, that's very slow.
So it looks like a disk-reading problem. The solutions would be get faster disks, get more RAM is it all can be cached, or if you have enough RAM than to use pg_prewarm to get all the data into the cache ahead of need. Or try increasing effective_io_concurrency, so that the bitmap heap scan can have more than one IO request outstanding at a time.
Your actual query seems to be more complex than the one you show, based on the Filter: entry and based on the Row Removed by Index Recheck: entry in combination with the lack of Lossy blocks. There might be some other things to try, but we would need to see the real query and the index definition (which apparently is not just an ordinary btree index on "height").

increase performance of a linq query using contains

I have a winforms app where I have a Telerik dropdownchecklist that lets the user select a group of state names.
Using EF and the database is stored in Azure SQL.
The code then hits a database of about 17,000 records and filters the results to only include states that are checked.
Works fine. I am wanting to update a count on the screen whenever they change the list box.
This is the code, in the itemCheckChanged event:
var states = stateDropDownList.CheckedItems.Select(i => i.Value.ToString()).ToList();
var filteredStops = (from stop in aDb.Stop_address_details where states.Contains(stop.Stop_state) select stop).ToArray();
ExportInfo_tb.Text = "Current Stop Count: " + filteredStops.Count();
It works, but it is slow.
I tried to load everything into a memory variable then querying that vs the database but can't seem to figure out how to do that.
Any suggestions?
Improvement:
I picked up a noticeable improvement by limiting the amount of data coming down by:
var filteredStops = (from stop in aDb.Stop_address_details where states.Contains(stop.Stop_state) select stop.Stop_state).ToList();
And better yet --
int count = (from stop in aDb.Stop_address_details where
states.Contains(stop.Stop_state)
select stop).Count();
ExportInfo_tb.Text = "Current Stop Count: " + count.ToString();
The performance of you query, actually, has nothing to do with Contiains, in this case. Contains is pretty performant. The problem, as you picked up on in your third solution, is that you are pulling far more data over the network than required.
In your first solution you are pulling back all of the rows from the server with the matching stop state and performing the count locally. This is the worst possible approach. You are pulling back data just to count it and you are pulling back far more data than you need.
In your second solution you limited the data coming back to a single field which is why the performance improved. This could have resulted in a significant improvement if your table is really wide. The problem with this is that you are still pulling back all the data just to count it locally.
In your third solution EF will translate the .Count() method into a query that performs the count for you. So the count will happen on the server and the only data returned is a single value; the result of count. Since network latency CAN often be (but is not always) the longest step when performing a query, returning less data can often result in significant gains in query speed.
The query translation of your final solution should look something like this:
SELECT COUNT(*) AS [value]
FROM [Stop_address_details] AS [t0]
WHERE [t0].[Stop_state] IN (#p0)

Neo4J / Cypher Query very slow with order by property

I have a graph database with 5M of nodes and 10M of relationships.
I'm on a Macbook Pro with 4GB RAM. I have already try to adjust java heap size and neo4j memory without success.
My problem is that i have a simply cypher query like that :
MATCH (pet:Pet {id:52163})-[r:FOLLOWS]->(friend)
MATCH (friend)-[r:POSTED]->(n)
RETURN friend.id, TYPE(r),LABELS(n),n.id
LIMIT 30;
This query takes 100ms , which is impressive. But when i add an "ORDER BY" this query takes a long time => 8s :/
MATCH (pet:Pet {id:52163})-[r:FOLLOWS]->(friend)
MATCH (friend)-[r:POSTED]->(n)
RETURN friend.id, TYPE(r),LABELS(n),n.id
ORDER BY r.date DESC
LIMIT 30;
Does Someone has an idea ?
You might want to consider relationship indexes to speed up your query. The date property could be indexed this way. You're using the ORDER BY keyword which will almost always make your query slower as it needs to iterate the entire result set to perform the ordering.
Also consider using a single MATCH statement if that suits your needs:
MATCH (pet:Pet {id:52163})-[r:FOLLOWS]->(friend)-[r:POSTED]->(n)

App Engine Datastore Viewer, how to show count of records using GQL?

I would think this would be easy for an SQL-alike! What I want is the GQL equivalent of:
select count(*) from foo;
and to get back an answer something similar to:
1972 records.
And I want to do this in GQL from the "command line" in the web-based DataStore viewer. (You know, the one that shows 20 at a time and lets me see "next 20")
Anyway -- I'm sure it's brain-dead easy, I just can't seem to find the correct syntax. Any help would be appreciated.
Thanks!
With straight Datastore Console, there is no direct way to do it, but I just figured out how to do it indirectly, with the OFFSET keyword.
So, given a table, we'll call foo, with a field called type that we want to check for values named "bar":
SELECT * FROM foo WHERE type="bar" OFFSET 1024
(We'll be doing a quick game of "warmer, colder" here, binary style)
Let's say that query returns nothing. Change OFFSET to 512, then 256, 128, 64, ... you get the idea. Same thing in reverse: Go up to 2048, 4096, 8192, 16384, etc. until you see no records, then back off.
I just did one here at work. Started with 2048, and noticed two records came up. There's 2049 in the table. In a more extreme case, (lets say there's 3300 records), you could start with 2048, notice there's a lot, go to 4096, there's none... Take the midpoint (1024 between 2048 and 4096 is 3072) next and notice you have records... From there you could add half the previous midpoint (512) to get 3584, and there's none. Whittle back down half (256) to get 3328, still none. Once more down half (128) to get 3200 and there's records. Go up half of the last val (64) and there's still records. Go up half again (32) to 3296 - still records, but so small you can easily see there's exactly 3300.
The nice thing about this vs. Datastore statistics to see how many records are in a table is you can limit it by the WHERE clause.
I don't think there is any direct way to get the count of entities via GQL. However you can get the count directly from the dashbaord
;
More details - https://cloud.google.com/appengine/docs/python/console/managing-datastore
As it's stated in other questions, it looks like there is no count aggregate function in GQL. The GQL Reference also doesn't say there is the ability to do this, though it doesn't explicitly say that it's not possible.
In the development console (running your application locally) it looks like just clicking the "List Entities" button will show you a list of all entities of a certain type, and you can see "Results 1-10 of (some number)" to get a total count in your development environment.
In production you can use the "Datastore Statistics" tab (the link right underneath the Datastore Viewer), choose "Display Statistics for: (your entity type)" and it will show you the total number of entities, however this is not the freshest view of the data (updated "at least once per day").
Since you can't run arbitrary code in production via the browser, I don't think saying "use .count() on a query" would help, but if you're using the Remote API, the .count() method is no longer capped at 1000 entries as of August, 2010, so you should be able to run print MyEntity.all().count() and get the result you want.
This is one of those surprising things that the datastore just can't do. I think the fastest way to do it would be to select __KEY__ from foo into a List, and then count the items in the list (which you can't do in the web-based viewer).
If you're happy with statistics that can be a little bit stale, you can go to the Datastore Statistics page of the admin console, which will tell you how many entities of each type there were some time ago. It seems like those stats are usually less than 10 hours old. Unfortunately, you can't query them more specifically.
There's no way to get a total count in GQL. Here's a way to get a count using python:
def count_models(model_class, max_fetch=1000):
total = 0
cursor = None
while True:
query = model_class.all(keys_only=True)
if cursor:
query.with_cursor(cursor)
results = query.fetch(max_fetch)
total += len(results)
print('still counting: ' + total)
if (len(results) < max_fetch):
return total
cursor = query.cursor()
You could run this function using the remote_api_shell, or add a custom page to your admin site to run this query. Obviously, if you've got millions of rows you're going to be waiting a while. You might be able to increase max_fetch, I'm not sure what the current fetch limit is.

Row count of a column family in Cassandra

Is there a way to get a row count (key count) of a single column family in Cassandra? get_count can only be used to get the column count.
For instance, if I have a column family containing users and wanted to get the number of users. How could I do it? Each user is it's own row.
If you are working on a large data set and are okay with a pretty good approximation, I highly recommend using the command:
nodetool --host <hostname> cfstats
This will dump out a list for each column family looking like this:
Column Family: widgets
SSTable count: 11
Space used (live): 4295810363
Space used (total): 4295810363
Number of Keys (estimate): 9709824
Memtable Columns Count: 99008
Memtable Data Size: 150297312
Memtable Switch Count: 434
Read Count: 9716802
Read Latency: 0.036 ms.
Write Count: 9716806
Write Latency: 0.024 ms.
Pending Tasks: 0
Bloom Filter False Postives: 10428
Bloom Filter False Ratio: 1.00000
Bloom Filter Space Used: 18216448
Compacted row minimum size: 771
Compacted row maximum size: 263210
Compacted row mean size: 1634
The "Number of Keys (estimate)" row is a good guess across the cluster and the performance is a lot faster than explicit count approaches.
If you are using an order-preserving partitioner, you can do this with get_range_slice or get_key_range.
If you are not, you will need to store your user ids in a special row.
I found an excellent article on this here.. http://www.planetcassandra.org/blog/post/counting-keys-in-cassandra
select count(*) from cf limit 1000000
Above statement can be used if we have an approximate upper bound known before hand. I found this useful for my case.
[Edit: This answer is out of date as of Cassandra 0.8.1 -- please see the Counters entry in the Cassandra Wiki for the correct way to handle Counter Columns in Cassandra.]
I'm new to Cassandra, but I have messed around a lot with Google's App Engine. If no other solution presents itself, you may consider keeping a separate counter in a platform that supports atomic increment operations like memcached. I know that Cassandra is working on atomic counter increment/decrement functionality, but it's not yet ready for prime time.
I can only post one hyperlink because I'm new, so for progress on counter support see the link in my comment below.
Note that this thread suggests ZooKeeper, memcached, and redis as possible solutions. My personal preference would be memcached.
http://www.mail-archive.com/user#cassandra.apache.org/msg03965.html
There is always map/reduce but that probably goes without saying. If you have that with hive or pig, then you can do it for any table across the cluster though I am not sure tasktrackers know about cassandra locality and so it may have to stream the whole table across the network so you get task trackers on cassandra nodes but the data they receive may be from another cassandra node :(. I would love to hear if anyone knows for sure though.
NOTE: We are setting up map/reduce on cassandra mainly because if we want an index later, we can map/reduce one into cassandra.
I have been getting the counts like this after I convert the data into a hash in PHP.

Resources