flask-sqlalchemy slow paginate count - database

I have a Postgres 10 database in my Flask app. I'm trying to paginate the filtering results on table over milions of rows. The problem is, that paginate method do counting total number of query results totaly ineffective.
Heres the example with dummy filter:
paginate = Buildings.query.filter(height>10).paginate(1,10)
Under the hood if perform 2 queries:
SELECT * FROM buildings where height > 10
SELECT count(*) FROM (
SELECT * FROM buildings where height > 10
)
--------
count returns 200,000 rows
The problem is that count on raw select without subquery is quite fast ~30ms, but paginate method wraps that into subquery that takes ~30s.
The query plan on cold database:
Is there an option of using default paginate method from flask-sqlalchemy in performant way?
EDIT:
To get the better understanding of my problem here is the real filter operations used in my case, but with dummy field names:
paginate = Buildings.query.filter_by(owner_id=None).filter(Buildings.address.like('%A%')).paginate(1,10)
So the SQL the ORM produce is:
SELECT count(*) AS count_1
FROM (SELECT foo_column, [...]
FROM buildings
WHERE buildings.owner_id IS NULL AND buildings.address LIKE '%A%' ) AS anon_1
That query is already optimized by indices from:
CREATE INDEX ix_trgm_buildings_address ON public.buildings USING gin (address gin_trgm_ops);
CREATE INDEX ix_buildings_owner_id ON public.buildings USING btree (owner_id)
The problem is just this count function, that's very slow.

So it looks like a disk-reading problem. The solutions would be get faster disks, get more RAM is it all can be cached, or if you have enough RAM than to use pg_prewarm to get all the data into the cache ahead of need. Or try increasing effective_io_concurrency, so that the bitmap heap scan can have more than one IO request outstanding at a time.
Your actual query seems to be more complex than the one you show, based on the Filter: entry and based on the Row Removed by Index Recheck: entry in combination with the lack of Lossy blocks. There might be some other things to try, but we would need to see the real query and the index definition (which apparently is not just an ordinary btree index on "height").

Related

SQL using more simple indexes

my questions are What indexes are used? In what order? Why? in following sample
Query:
SELECT House
FROM myTable
WHERE 1=1
and City='myCity'
and Street='myStreet'
and Color='myColor'
Indexes:
Ind1: City
Ind2: Street
Ind3: Color
Ind4: Street,Color
It depends on... The server might have statistics, so it will choose the index which has the most effective filtering like:
if City='myCity' returns 100
if Street='myStreet' returns 1000
if Color='myColor' returns 10000
element, then City index will be used. This logic is valid for composite indexes as well.
The optimizer will try to get the smallest set first then the other filters will be applied on this.
This requires uptodate statistic, otherwise the wrong index might be used.

Neo4J / Cypher Query very slow with order by property

I have a graph database with 5M of nodes and 10M of relationships.
I'm on a Macbook Pro with 4GB RAM. I have already try to adjust java heap size and neo4j memory without success.
My problem is that i have a simply cypher query like that :
MATCH (pet:Pet {id:52163})-[r:FOLLOWS]->(friend)
MATCH (friend)-[r:POSTED]->(n)
RETURN friend.id, TYPE(r),LABELS(n),n.id
LIMIT 30;
This query takes 100ms , which is impressive. But when i add an "ORDER BY" this query takes a long time => 8s :/
MATCH (pet:Pet {id:52163})-[r:FOLLOWS]->(friend)
MATCH (friend)-[r:POSTED]->(n)
RETURN friend.id, TYPE(r),LABELS(n),n.id
ORDER BY r.date DESC
LIMIT 30;
Does Someone has an idea ?
You might want to consider relationship indexes to speed up your query. The date property could be indexed this way. You're using the ORDER BY keyword which will almost always make your query slower as it needs to iterate the entire result set to perform the ordering.
Also consider using a single MATCH statement if that suits your needs:
MATCH (pet:Pet {id:52163})-[r:FOLLOWS]->(friend)-[r:POSTED]->(n)

lua and lsqlite3: speeding up select statement

I'm using the lsqlite3 lua wrapper and I'm making queries into a database. My DB has ~5million rows and the code I'm using to retrieve rows is akin to:
db = lsqlite3.open('mydb')
local temp = {}
local sql = "SELECT A,B FROM tab where FOO=BAR ORDER BY A DESC LIMIT N"
for row in db:nrows(sql) do temp[row['key']] = row['col1'] end
As you can see I'm trying to get the top N rows sorted in descending order by FOO (I want to get the top rows and then apply the LIMIT not the other way around). I indexed the column A but it doesn't seem to make much of a difference. How can I make this faster?
You need to index the column on which you filter (i.e. with the WHERE clause). THe reason is that ORDER BY comes into play after filtering, not the other way around.
So you probably should create an index on FOO.
Can you post your table schema?
UPDATE
Also you can increase the sqlite cache, e.g.:
PRAGMA cache_size=100000
You can adjust this depending on the memory available and the size of your database.
UPDATE 2
I you want to have a better understanding of how your query is handled by sqlite, you can ask it to provide you with the query plan:
http://www.sqlite.org/eqp.html
UPDATE 3
I did not understand your context properly with my initial answer. If you are to ORDER BY on some large data set, you probably want to use that index, not the previous one, so you can tell sqlite to not use the index on FOO this way:
SELECT a, b FROM foo WHERE +a > 30 ORDER BY b

Google app engine and paging

How would one go about writing a query that selects items 2000-2010 out of a collection of 10000 objects in the data store.
I know that it can be done like this in GQL:
select * from MyObject limit 10 offset 2000
According to the documentation, when using an offset the engine will still fetch all the rows, only not return them, thus making the query perform in a way that corresponds linearly with the value of offset.
Is there any better way? Such as using a pseudo ROWNUM column like one could do in other types of data stores.
There's no way to efficiently page using offsets, except to cache the results. You can, however, use datastore cursors to implement paging using a 'bookmark' type approach.
Besides using cursors you can also use a sort order approach. For example:
SELECT * FROM MyObject ORDER BY field LIMIT 10;
for the first 10 objects and then for the next 10 objects, etc.
SELECT * FROM MyObject WHERE field > largestFieldValueFromPreviousResult ORDER BY field LIMIT 10;
Field could even be a key if you don't have another appropriate field. Here is a more complete example:
http://code.google.com/appengine/articles/paging.html

Adding a projection to an NHibernate criteria stops it from performing default entity selection

I'm writing an NHibernate criteria that selects data supporting paging. I'm using the COUNT(*) OVER() expression from SQL Server 2005(+) to get hold of the total number of available rows, as suggested by Ayende Rahien. I need that number to be able to calculate how many pages there are in total. The beauty of this solution is that I don't need to execute a second query to get hold of the row count.
However, I can't seem to manage to write a working criteria (Ayende only provides an HQL query).
Here's an SQL query that shows what I want and it works just fine. Note that I intentionally left out the actual paging logic to focus on the problem:
SELECT Items.*, COUNT(*) OVER() AS rowcount
FROM Items
Here's the HQL:
select
item, rowcount()
from
Item item
Note that the rowcount() function is registered in a custom NHibernate dialect and resolves to COUNT(*) OVER() in SQL.
A requirement is that the query is expressed using a criteria. Unfortunately, I don't know how to get it right:
var query = Session
.CreateCriteria<Item>("item")
.SetProjection(
Projections.SqlFunction("rowcount", NHibernateUtil.Int32));
Whenever I add a projection, NHibernate doesn't select item (like it would without a projection), just the rowcount() while I really need both. Also, I can't seem to project item as a whole, only it's properties and I really don't want to list all of them.
I hope someone has a solution to this. Thanks anyway.
I think it is not possible in Criteria, it has some limits.
You could get the id and load items in a subsequent query:
var query = Session
.CreateCriteria<Item>("item")
.SetProjection(Projections.ProjectionList()
.Add(Projections.SqlFunction("rowcount", NHibernateUtil.Int32))
.Add(Projections.Id()));
If you don't like it, use HQL, you can set the maximal number of results there too:
IList<Item> result = Session
.CreateQuery("select item, rowcount() from item where ..." )
.SetMaxResult(100)
.List<Item>();
Use CreateMultiCriteria.
You can execute 2 simple statements with only one hit to the DB that way.
I am wondering why using Criteria is a requirement. Can't you use session.CreateSQLQuery? If you really must do it in one query, I would have suggested pulling back the Item objects and the count, like:
select {item.*}, count(*) over()
from Item {item}
...this way you can get back Item objects from your query, along with the count. If you experience a problem with Hibernate's caching, you can also configure the query spaces (entity/table caches) associated with a native query so that stale query cache entries will be cleared automatically.
If I understand your question properly, I have a solution. I struggled quite a bit with this same problem.
Let me quickly describe the problem I had, to make sure we're on the same page. My problem came down to paging. I want to display 10 records in the UI, but I also want to know the total number of records that matched the filter criteria. I wanted to accomplish this using the NH criteria API, but when adding a projection for row count, my query no longer worked, and I wouldn't get any results (I don't remember the specific error, but it sounds like what you're getting).
Here's my solution (copy & paste from my current production code). Note that "SessionError" is the name of the business entity I'm retrieving paged data for, according to 3 filter criterion: IsDev, IsRead, and IsResolved.
ICriteria crit = CurrentSession.CreateCriteria(typeof (SessionError))
.Add(Restrictions.Eq("WebApp", this));
if (isDev.HasValue)
crit.Add(Restrictions.Eq("IsDev", isDev.Value));
if (isRead.HasValue)
crit.Add(Restrictions.Eq("IsRead", isRead.Value));
if (isResolved.HasValue)
crit.Add(Restrictions.Eq("IsResolved", isResolved.Value));
// Order by most recent
crit.AddOrder(Order.Desc("DateCreated"));
// Copy the ICriteria query to get a row count as well
ICriteria critCount = CriteriaTransformer.Clone(crit)
.SetProjection(Projections.RowCountInt64());
critCount.Orders.Clear();
// NOW add the paging vars to the original query
crit = crit
.SetMaxResults(pageSize)
.SetFirstResult(pageNum_oneBased * pageSize);
// Set up a multi criteria to get your data in a single trip to the database
IMultiCriteria multCrit = CurrentSession.CreateMultiCriteria()
.Add(crit)
.Add(critCount);
// Get the results
IList results = multCrit.List();
List<SessionError> sessionErrors = new List<SessionError>();
foreach (SessionError sessErr in ((IList)results[0]))
sessionErrors.Add(sessErr);
numResults = (long)((IList)results[1])[0];
So I create my base criteria, with optional restrictions. Then I CLONE it, and add a row count projection to the CLONED criteria. Note that I clone it before I add the paging restrictions. Then I set up an IMultiCriteria to contain the original and cloned ICriteria objects, and use the IMultiCriteria to execute both of them. Now I have my paged data from the original ICriteria (and I only dragged the data I need across the wire), and also a raw count of how many actual records matched my criteria (useful for display or creating paging links, or whatever). This strategy has worked well for me. I hope this is helpful.
I would suggest investigating custom result transformer by calling SetResultTransformer() on your session.
Create a formula property in the class mapping:
<property name="TotalRecords" formula="count(*) over()" type="Int32" not-null="true"/>;
IList<...> result = criteria.SetFirstResult(skip).SetMaxResults(take).List<...>();
totalRecords = (result != null && result.Count > 0) ? result[0].TotalRecords : 0;
return result;

Resources