Django query filter using large array of ids in Postgres DB - arrays

I want to pass a query in Django to my PostgreSQL database. When I filter my query using a large array of ids, the query is very slow and goes up to 70s.
After looking for an answer I saw this post which gives a solution to my problem, simply change the ARRAY [ids] in IN statement by VALUES (id1), (id2), ....
I tested the solution with a raw query in pgadmin, the query goes from 70s to 300ms...
How can I do the same command (i.e. not using an array of ids but a query with VALUES) in Django?

I found a solution building on #erwin-brandstetter answer using a custom lookup
from django.db.models import Lookup
from django.db.models.fields import Field
#Field.register_lookup
class EfficientInLookup(Lookup):
lookup_name = "ineff"
def as_sql(self, compiler, connection):
lhs, lhs_params = self.process_lhs(compiler, connection)
rhs, rhs_params = self.process_rhs(compiler, connection)
params = lhs_params + rhs_params
return "%s IN (SELECT unnest(%s))" % (lhs, rhs), params
This allows to filter like this:
MyModel.objects.filter(id__ineff=<list-of-values>)

The trick is to transform the array to a set somehow.
Instead of (this form is only good for a short array):
SELECT *
FROM tbl t
WHERE t.tbl_id = ANY($1);
-- WHERE t.tbl_id IN($1); -- equivalent
$1 being the array parameter.
You can still pass an array like you had it, but unnest and join. Like:
SELECT *
FROM tbl t
JOIN unnest($1) arr(id) ON arr.id = t.tbl_id;
Or you can keep your query, too, but replace the array with a subquery unnesting it:
SELECT * FROM tbl t
WHERE t.tbl_id = ANY (SELECT unnest($1));
Or:
SELECT * FROM tbl t
WHERE t.tbl_id IN (SELECT unnest($1));
Same effect for performance as passing a set with a VALUES expression. But passing the array is typically much simpler.
Detailed explanation:
IN vs ANY operator in PostgreSQL
How to use ANY instead of IN in a WHERE clause with Rails?
Optimizing a Postgres query with a large IN

Is this an example of the first thing you're asking?
relation_list = list(ModelA.objects.filter(id__gt=100))
obj_query = ModelB.objects.filter(a_relation__in=relation_list)
That would be an "IN" command because you're first evaluating relation_list by casting it to a list, and then using it in your second query.
If instead you do the exact same thing, Django will only make one query, and do SQL optimization for you. So it should be more efficient that way.
You can always see the SQL command you'll be executing with obj_query.query if you're curious what's happening under the hood.
Hope that answers the question, sorry if it doesn't.

I had lots of trouble to make the custom lookup 'ineff' work.
I may have solved it, but would love some validation from Django and Postgres experts.
1) Using it 'directly' on a ForeignKey field (ModelB)
ModelA.objects.filter(ModelB__ineff=queryset_ModelB)
Throws the following exception:
"Related Field got invalid lookup: ineff"
ForeignKey fields cannot be used with custom lookups.
A similar issue is reported here:
Custom lookup is not being registered in Django
2) Using it 'indirectly' on the pk field of related model (ModelB.id)
ModelA.objects.filter(ModelB__id__ineff=queryset_ModelB.values_list('id', flat=True))
Throws the following exception:
"can only concatenate list (not "tuple") to list"
Looking at Django Traceback, I noticed that rhs_params is a tuple.
Yet we try to add it to lhs_params (a list) in our custom lookup.
Hence I changed:
params = lhs_params + rhs_params
into:
params = lhs_params + list(rhs_params)
3) I then got a Postgres error (at least I had passed Django ORM)
"function unnest(uuid) does not exist"
"HINT: No function matches the given name and argument types. You might need to add explicit type casts."
I apparently solved it by changing the sql:
from:
return "%s IN (SELECT unnest(%s))" % (lhs, rhs), params
to:
return "%s IN (SELECT unnest(ARRAY(%s)))" % (lhs, rhs), params
Hence my final as_sql method looks like this:
def as_sql(self, compiler, connection):
lhs, lhs_params = self.process_lhs(compiler, connection)
rhs, rhs_params = self.process_rhs(compiler, connection)
params = lhs_params + list(rhs_params)
return "%s IN (SELECT unnest(ARRAY(%s)))" % (lhs, rhs), params
It seems to work, and is indeed faster than in__ (tested with EXPLAIN ANALYZE in Postgres).
But I would love to have some validation from experts, perhaps Erwin Brandstetter?
Thanks for your input.

Related

django query get rows where two columns difference is greater than a value

I have a table in my model including some feature and I want to execute select query like this in Django:
SELECT * FROM TABLE WHERE column1-column2 > 10000
I tried filter(), but after a little search I found out that I should use .annotate() and I changed my query to:
Account.objects.annotate(realcharge=(F('charge')-F('amount')), realcharge__lt=10000)
But i get this error:
'int' object has no attribute 'resolve_expression'
How should I write my query?
my django version is 1.11.
I don't know if you still need an answer, but you could try the .filter() method after the call to .annotate(), like this:
# Note: I split this query in 2 lines for better readability
qs = Account.objects.annotate(realcharge=(F('charge')-F('amount')))
qs = qs.filter(realcharge__lt=10000)

Lucene Query syntax using Boolean Clauses

I have two fields in Lucene
type (can contain values like X, Y, Z)
date (contains values like 2015-18-10 etc)
I want to write following query: (type = X and date=today's data) OR (type = anything except X).
How can I write this query using SHOULD, MUST, MUST_NOT? looks like there is no clause for these type of query.
You can express the latter part using *:* -type:X, as this creates the set of all documents, and then subtracts the set of documents that has type:X. The *:* query is represented as MatchAllDocsQuery in code.
If I got your problem, I think the solution is just some combination of BooleanQuery, following is the code written in Scala to address the issue.
According to the documentation(in BooleanClause.java), MUST_NOT should be used with caution.
Use this operator for clauses that must not appear in the matching documents.Note that it is not possible to search for queries that only consist of a MUST_NOT clause.
object LuceneTest extends App {
val query = new BooleanQuery
val subQuery1 = new BooleanQuery
subQuery1.add(new TermQuery(new Term("type", "xx")), BooleanClause.Occur.MUST)
subQuery1.add(new TermQuery(new Term("date", "yy")), BooleanClause.Occur.MUST)
val subQuery2 = new BooleanQuery
// As mentioned above, so I put MatchAllDocsQuery here to avoid only containing MUST_NOT
subQuery2.add(new MatchAllDocsQuery, BooleanClause.Occur.MUST)
subQuery2.add(new TermQuery(new Term("type", "xx")),BooleanClause.Occur.MUST_NOT)
// subQuery1 and subQuery2 construct two subQueries respectively
// then use OR(i.e SHOULD in Lucene) to combine them
query.add(subQuery1, BooleanClause.Occur.SHOULD)
query.add(subQuery2, BooleanClause.Occur.SHOULD)
query
}
Anyway, hope it helps.

Slick: Difficulties working with Column[Int] values

I have a followup to another Slick question I recently asked (Slick table Query: Trouble with recognizing values) here. Please bear with me!! I'm new to databasing and Slick seems especially poor on documentation. Anyway, I have this table:
object Users extends Table[(Int, String)]("Users") {
def userId = column[Int]("UserId", O.PrimaryKey, O.AutoInc)
def userName = column[String]("UserName")
def * = userId ~ userName
}
Part I
I'm attempting to query with this function:
def findByQuery(where: List[(String, String)]) = SlickInit.dbSlave withSession {
val q = for {
x <- Users if foo((x.userId, x.userName), where)
} yield x
q.firstOption.map { case(userId, userName) =>
User(userId, userName)}
}
where "where" is a list of search queries //ex. ("userId", "1"),("userName", "Alex")
"foo" is a helper function that tests equality. I'm running into a type error.
x.userId is of type Column[Int]. How can one manipulate this as an Int? I tried casting, ex:
foo(x.userId.asInstanceOf[Int]...)
but am also experiencing trouble with that. How does one deal with Slick return types?
Part II
Is anyone familiar with the casting function:
def * = userId ~ userName <> (User, User.unapply _)
? I know there have been some excellent answers to this question, most notably here: scala slick method I can not understand so far and a very similar question here: mapped projection with companion object in SLICK. But can anyone explain why the compiler responds with
<> method overloaded
for that simple line of code?
Let's start with the problem:
val q = for {
x <- Users if foo((x.userId, x.userName), where)
} yield x
See, Slick transforms Scala expressions into SQL. To be able to transform conditions, as you want, into SQL statement, Slick requires some special types to be used. The way these types works are actually part of the transformation Slick performs.
For example, when you write List(1,2,3) filter { x => x == 2 } the filter predicate is executed for each element in the list. But Slick can't do that! So Query[ATable] filter { arow => arow.id === 2 } actually means "make a select with the condition id = 2" (I am skipping details here).
I wrote a mock of your foo function and asked Slick to generate the SQL for the query q:
select x2."UserId", x2."UserName" from "Users" x2 where false
See the false? That's because foo is a simple predicate that Scala evaluates into the Boolean false. A similar predicate done in a Query, instead of a list, evaluates into a description of a what needs to be done in the SQL generation. Compare the difference between the filters in List and in Slick:
List[A].filter(A => Boolean):List[A]
Query[E,U].filter[T](f: E => T)(implicit wt: CanBeQueryCondition[T]):Query[E,U]
List filter evaluates to a list of As, while Query.filter evaluates into new Query!
Now, a step towards a solution.
It seems that what you want is actually the in operator of SQL. The in operator returns true if there is an element in a list, eg: 4 in (1,2,3,4) is true. Notice that (1,2,3,4) is a SQL list, not Tuple like in Scala.
For this use case of the in SQL operator Slick uses the operator inSet.
Now comes the second part of the problem. (I renamed the where variable to list, because where is a Slick method)
You could try:
val q = for {
x <- Users if (x.userId,x.userName) inSet list
} yield x
But that won't compile! That's because SQL doesn't have Tuples the way Scala has. In SQL you can't do (1,"Alfred") in ((1,"Alfred"),(2, "Mary")) (remember, the (x,y,z) is the SQL syntax for lists, I am abusing the syntax here only to show that it's invalid -- also there are many dialects of SQL out there, it is possible some of them do support tuples and lists in a similar way.)
One possible solution is to use only the userId field:
val q = for {
x <- Users if x.userId inSet list2
} yield x
This generates select x2."UserId", x2."UserName" from "Users" x2 where x2."UserId" in (1, 2, 3)
But since you are explicitly using user id and user name, it's reasonable to assume that user id doesn't uniquely identify a user. So, to amend that we can concatenate both values. Of course, we need to do the same in the list.
val list2 = list map { t => t._1 + t._2 }
val q2 = for {
x <- Users if (x.userId.asColumnOf[String] ++ x.userName) inSet list2
} yield x
Look the generated SQL:
select x2."UserId", x2."UserName" from "Users" x2
where (cast(x2."UserId" as VARCHAR)||x2."UserName") in ('1a', '3c', '2b')
See the above ||? It's the string concatenation operator used in H2Db. H2Db is the Slick driver I am using to run your example. This query and the others can vary slightly depending on the database you are using.
Hope that it clarifies how slick works and solve your problem. At least the first one. :)
Part I:
Slick uses Column[...]-types instead of ordinary Scala types. You also need to define Slick helper functions using Column types. You could implement foo like this:
def foo( columns: (Column[Int],Column[String]), values: List[(Int,String)] ) : Column[Boolean] = values.map( value => columns._1 === value._1 && columns._2 === value._2 ).reduce( _ || _ )
Also read pedrofurla's answer to better understand how Slick works.
Part II:
Method <> is indeed overloaded and when types don't work out the Scala compiler can easily become uncertain which overload it should use. (We should get rid of the overloading in Slick I think.) Writing
def * = userId ~ userName <> (User.tupled _, User.unapply _)
may slightly improve the error message you get. To solve the problem make sure that the Column types of userId and userName exactly correspond to the member types of your User case class, which should look something like case class User( id:Int, name:String ). Also make sure, that you extend Table[User] (not Table[(Int,String)]) when mapping to User.

Datastore Query filtering on list

Select all records, ID which is not in the list
How to make like :
query = Story.all()
query.filter('ID **NOT IN** =', [100,200,..,..])
There's no way to do this efficiently in App Engine. You should simply select everything without that filter, and filter out any matching entities in your code.
This is now supported via GQL query
The 'IN' and '!=' operators in the Python runtime are actually
implemented in the SDK and translate to multiple queries 'under the
hood'.
For example, the query "SELECT * FROM People WHERE name IN ('Bob',
'Jane')" gets translated into two queries, equivalent to running
"SELECT * FROM People WHERE name = 'Bob'" and "SELECT * FROM People
WHERE name = 'Jane'" and merging the results. Combining multiple
disjunctions multiplies the number of queries needed, so the query
"SELECT * FROM People WHERE name IN ('Bob', 'Jane') AND age != 25"
generates a total of four queries, for each of the possible conditions
(age less than or greater than 25, and name is 'Bob' or 'Jane'), then
merges them together into a single result set.
source: appengine blog
This is an old question, so I'm not sure if the ID is a non-key property. But in order to answer this:
query = Story.all()
query.filter('ID **NOT IN** =', [100,200,..,..])
...With ndb models, you can definitely query for items that are in a list. For example, see the docs here for IN and !=. Here's how to filter as the OP requested:
query = Story.filter(Story.id.IN([100,200,..,..])
We can even query for items that in a list of repeated keys:
def all(user_id):
# See if my user_id is associated with any Group.
groups_belonged_to = Group.query().filter(user_id == Group.members)
print [group.to_dict() for group in belong_to]
Some caveats:
There's docs out there that mention that in order to perform these types of queries, Datastore performs multiple queries behind the scenes, which (1) might take a while to execute, (2) take longer if you searching in repeated properties, and (3) will up your costs with more operations.

Hitting the 2100 parameter limit (SQL Server) when using Contains()

from f in CUSTOMERS
where depts.Contains(f.DEPT_ID)
select f.NAME
depts is a list (IEnumerable<int>) of department ids
This query works fine until you pass a large list (say around 3000 dept ids) .. then I get this error:
The incoming tabular data stream (TDS) remote procedure call (RPC) protocol stream is incorrect. Too many parameters were provided in this RPC request. The maximum is 2100.
I changed my query to:
var dept_ids = string.Join(" ", depts.ToStringArray());
from f in CUSTOMERS
where dept_ids.IndexOf(Convert.ToString(f.DEPT_id)) != -1
select f.NAME
using IndexOf() fixed the error but made the query slow. Is there any other way to solve this? thanks so much.
My solution (Guids is a list of ids you would like to filter by):
List<MyTestEntity> result = new List<MyTestEntity>();
for(int i = 0; i < Math.Ceiling((double)Guids.Count / 2000); i++)
{
var nextGuids = Guids.Skip(i * 2000).Take(2000);
result.AddRange(db.Tests.Where(x => nextGuids.Contains(x.Id)));
}
this.DataContext = result;
Why not write the query in sql and attach your entity?
It's been awhile since I worked in Linq, but here goes:
IQuery q = Session.CreateQuery(#"
select *
from customerTable f
where f.DEPT_id in (" + string.Join(",", depts.ToStringArray()) + ")");
q.AttachEntity(CUSTOMER);
Of course, you will need to protect against injection, but that shouldn't be too hard.
You will want to check out the LINQKit project since within there somewhere is a technique for batching up such statements to solve this issue. I believe the idea is to use the PredicateBuilder to break the local collection into smaller chuncks but I haven't reviewed the solution in detail because I've instead been looking for a more natural way to handle this.
Unfortunately it appears from Microsoft's response to my suggestion to fix this behavior that there are no plans set to have this addressed for .NET Framework 4.0 or even subsequent service packs.
https://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=475984
UPDATE:
I've opened up some discussion regarding whether this was going to be fixed for LINQ to SQL or the ADO.NET Entity Framework on the MSDN forums. Please see these posts for more information regarding these topics and to see the temporary workaround that I've come up with using XML and a SQL UDF.
I had similar problem, and I got two ways to fix it.
Intersect method
join on IDs
To get values that are NOT in list, I used Except method OR left join.
Update
EntityFramework 6.2 runs the following query successfully:
var employeeIDs = Enumerable.Range(3, 5000);
var orders =
from order in Orders
where employeeIDs.Contains((int)order.EmployeeID)
select order;
Your post was from a while ago, but perhaps someone will benefit from this. Entity Framework does a lot of query caching, every time you send in a different parameter count, that gets added to the cache. Using a "Contains" call will cause SQL to generate a clause like "WHERE x IN (#p1, #p2.... #pn)", and bloat the EF cache.
Recently I looked for a new way to handle this, and I found that you can create an entire table of data as a parameter. Here's how to do it:
First, you'll need to create a custom table type, so run this in SQL Server (in my case I called the custom type "TableId"):
CREATE TYPE [dbo].[TableId] AS TABLE(
Id[int] PRIMARY KEY
)
Then, in C#, you can create a DataTable and load it into a structured parameter that matches the type. You can add as many data rows as you want:
DataTable dt = new DataTable();
dt.Columns.Add("id", typeof(int));
This is an arbitrary list of IDs to search on. You can make the list as large as you want:
dt.Rows.Add(24262);
dt.Rows.Add(24267);
dt.Rows.Add(24264);
Create an SqlParameter using the custom table type and your data table:
SqlParameter tableParameter = new SqlParameter("#id", SqlDbType.Structured);
tableParameter.TypeName = "dbo.TableId";
tableParameter.Value = dt;
Then you can call a bit of SQL from your context that joins your existing table to the values from your table parameter. This will give you all records that match your ID list:
var items = context.Dailies.FromSqlRaw<Dailies>("SELECT * FROM dbo.Dailies d INNER JOIN #id id ON d.Daily_ID = id.id", tableParameter).AsNoTracking().ToList();
You could always partition your list of depts into smaller sets before you pass them as parameters to the IN statement generated by Linq. See here:
Divide a large IEnumerable into smaller IEnumerable of a fix amount of item

Resources