Work around the SQLite parameter limit in a select query - peewee

I have a GUI application with a list of people which contains the person's database id and their attributes. Something like this:
+----+------+
| ID | Name |
+----+------+
| 1 | John |
| 2 | Fred |
| 3 | Mary |
[...]
This list can be filtered, so the amount and type of people depend from time to time. To get a list of Peewee Person objects I first get the list of visible IDs and use the following query:
ids = [row[0] for row in store]
Person.select().where(Person.id.in_(ids))
Which in turn translates to the following SQL:
('SELECT "t1"."id", "t1"."name" FROM "person" AS "t1" WHERE ("t1"."id" IN (?, ?, ?, ...))', [1, 2, 3, ...])
This throws an OperationalError: too many SQL variables error on Windows with more than 1000 people. This is documented in the Peewee and SQLite docs. Workarounds given online usually relate to bulk inserts and ways to split the action in chunks. Is there any way to work around this limitation with the mentioned SELECT ... WHERE ... IN query?
Getting the separate objects in a list comprehension is too slow:
people = [Person.get_by_id(row[0]) for row in store]
Maybe split the list of IDs in max 1000 items, use the select query on each chunk and then combine those somehow?

Where are the IDs coming from? The best answer is to avoid using that many parameters, of course. For example, if your list of IDs could be represented as a query of some sort, then you can just write a subquery, e.g.
my_friends = (Relationship
.select(Relationship.to_user)
.where(Relationship.from_user == me))
tweets_by_friends = Tweet.select().where(Tweet.user.in_(my_friends))
In the above, we could get all the user IDs from the first query and pass them en-masse as a list into the second query. But since the first query ("all my friends") is itself a query, we can just compose them. You could also use a JOIN instead of a subquery, but hopefully you get the point.
If this is not possible and you seriously have a list of >1000 IDs...how is such a list useful in a GUI application? Over 1000 anything is quite a lot of things.
To try and answer the question you asked -- you'll have to chunk them up. Which is fine. Just:
user_ids = list_of_user_ids
accum = []
# 100 at a time.
for i in range(0, user_ids, 100):
query = User.select().where(User.id.in_(user_ids[i:i+100]))
accum.extend([user for user in query])
return accum
But seriously, I think there's a problem with the way you're implementing this that makes it even necessary to filter on so many ids.

Related

Nested FutureBuilder vs nested calls for lazy loading from database

I need to choose best approach between two approaches that I can follow.
I have a Flutter app that use sqflite to save data, inside the database I have two tables:
Employee:
+-------------+-----------------+------+
| employee_id | employee_name |dep_id|
+-------------+-----------------+------+
| e12 | Ada Lovelace | dep1 |
+-------------+-----------------+------+
| e22 | Albert Einstein | dep2 |
+-------------+-----------------+------+
| e82 | Grace Hopper | dep3 |
+-------------+-----------------+------+
SQL:
CREATE TABLE Employee(
employee_id TEXT NOT NULL PRIMARY KEY,
employee_name TEXT NOT NULL ,
dep_id TEXT,
FOREIGN KEY(dep_id) REFERENCES Department(dep_id)
ON DELETE SET NULL
);
Department:
+--------+-----------+-------+
| dep_id | dep_title |dep_num|
+--------+-----------+-------+
| dep1 | Math | dep1 |
+--------+-----------+-------+
| dep2 | Physics | dep2 |
+--------+-----------+-------+
| dep3 | Computer | dep3 |
+--------+-----------+-------+
SQL:
CREATE TABLE Department(
dep_id TEXT NOT NULL PRIMARY KEY,
dep_title TEXT NOT NULL ,
dep_num INTEGER,
);
I need to show a ListGrid of departments that are stored in the Employee table. I should look at Employee table and fetch department id from it, This is easy but after fetching that dep_id I need to make a card from those ids so I need information from Department table.
complete inforamtion for thoses id I had fetched from Emplyee table is inside Department table.
There are thousands of rows in each table.
I have a database helper class to connect to the database :
DbHelper is something like this:
Future<List<String>> getDepartmentIds() async{
'fetch all dep_id from Employee table'
}
Future<Department> getDepartment(String id) async{
'fetch Department from Department table for a specific id'
}
Future<List<Department>> getEmployeeDepartments() async{
'''1.fetch all dep_id from Employee table
2.for each id fetch Department records from Department table'''
var ids = await getDepartmentIds();
List<Departments> deps=[];
ids.forEach((map) async {
deps.add(await getDepartment(map['dep_id']));
});
}
There is two approaches:
First One:
Define a function in dbhelper that returns all dep_id from Employee table(getDepartmentIds and another function that returns a department object(model) for that specific id.(getDepartment)
Now I need two FutureBuilder inside each other, one for fetching ids and the other one for fetching department model.
second One:
Define a function that first fetch ids then inside that function each id is maped to department model.(getEmployeeDepartments)
So I need one FutureBuilder .
Which one is better??
should I let FutureBuilders handle it or I should put pressure on dbHelper to habdle it?
If I use the first approach then I have to(as far as I can imagine!) put the the second future call(the one that fetch Department Object(model) based on it's id(getDepartment)) on build function and it's recommended no to do so.
And the problem with second one is that it does a lot of nested call in dbHelper.
I used ListView.builder for performance.
I checked both with some data but couldn't figure out which one is better. I guess it depends both on flutter and sqlite(sqflite).
which one is better or is there any better approach?
Given that I don't see too much code on this example, I'll do a high-level answer on your questions.
Evaluate Approach One
Right off the bat this part sticks out: "returns all dep_id from Employee table"
I would say scratch that, since "return all" is typically never a good solution, especially since you mention your tables have a lot of rows.
Evaluate Approach Two
I'm not sure what the difference in performance this has compared to the first approach, seems also bad for the same reasons. I think this one just changes your UI logic a big is all.
Typical 'Endless' List Approach
You would do a query on the Employees table with a join to the Departments table.
You would implement Pagination on your UI and pass in your values to the query from step one.
At a basic level you'll need these variables: Take, Skip, HasMore
Take: The count # of items to request each query
Skip: The count # of items to skip on the next query, this will be the size of the number of items you currently have in your List in memory driving your UI.
HasMore: You can set this on the response of each query, to let the UI know if there are still more items or not.
As you scroll down the list, when you get to the bottom, you will request more items .
Initially issue a query for example: Take: 10, Skip: 0
Next query when you hit the bottom of the UI: Take: 10, Skip: 10
etc..
Example sql query:
SELECT *
FROM Employees E
JOIN Departments D on D.id = E.dept_id
order by E.employee_name
offset {SKIP#} rows
FETCH NEXT {TAKE#} rows only
Hopefully, this helps, I'm not fully sure what you're trying to do actually - in terms of Code.
As far as I can tell, what you're looking to do is get a list of employees with relevant info including department.
If that's the case, then it's tailor made for INNER JOIN. Something like this:
SELECT Employee.*, Department.dep_id, Department.dep_title
FROM Employee INNER JOIN Department
ON Employee.dep_id = Department.dep_id;
(although you may want to double check that, my SQL is a bit rusty).
This would do what you need in one step. However, there is still the issue of what you're asking which seems to be "Is it more efficient to do many small requests or one big one, and what are the performance ramifications".
The answer to that is a bit specific to Flutter. What's happening when you do a request with SQFLITE, is that it is processing whatever you've passed to it, sending it to java/objc and possibly doing more processing and pushes processing to a backround thread, which then calls to the SQLITE library which does more processing to understand the request, then actually reads the data on the disk to do the operation, then returns back to the java/objc layer, which pushes the response to the UI thread, which in turns responds back to dart.
If that doesn't sound particularly efficient, that's because it isn't =D. If you're doing this a few times (or even a few hundred) it's probably fine, but if you're getting into thousands as you state it might start slowing down.
The alternative you've proposed is to do one large request. You will know better than I whether that is wise; if it's a couple thousand but only ever a couple thousand, and the data you're returning is always going to be relatively small (i.e. just a 10-20 character name and department name), then you'll have say (20+20)*2000 = 8000b = 80kb of data. Even if you assume the overhead will double that size, 160 kb of data shouldn't be enough to faze any relatively recent smartphone (after all that's much smaller than any single photo!).
Now, taking some domain specific knowledge, you could optimize this. For example, if you know the number of departments is much smaller than employees (i.e. < 100 or something), you could skip the entire issue of doing joins, and simply request all departments before this begins and put it in a map (dep_id => dep_title), and then once you've requested employees you could just simply do that lookup from dep_id to dep_title yourself. That way your requests wouldn't have to include the dep_title over and over again.
That being said, you may want to consider paging the employee lookup whether or not you use a join. You'd do this by requesting 100 employees (or whatever number) at a time rather than the entire batch - that way you don't have the overhead of 1000+ calls through the stack, but you also don't have a large block of data all in memory all at once.
SELECT * FROM Employee
WHERE employee_name >= LastValue
ORDER BY employee_name
LIMIT 100;
Unfortunately that doesn't fit in as well with how flutter does lists, so you'd probably need to have something like a 'EmployeeDatabaseManager' that does the actual requests, and your list would call into it to get the data. That's probably beyond the scope of this question though.

Check sql table for values in another table

If I have the following data:
Results Table
.[Required]
I want one grape
I want one orange
I want one apple
I want one carrot
I want one watermelon
Fruit Table
.[Name]
grape
orange
apple
What I want to do is essentially say give me all results where users are looking for a fruit. This is all just example, I am looking at a table with roughly 1 million records and a string field of 4000+ characters. I am expecting a somewhat slow result and I know that the table could DEFINITELY be structured better, but I have no control of that. Here is the query I would essentially have, but it doesn't seem to do what I want. It gives every record. And yes, [#Fruit] is a temp table.
SELECT * FROM [Results]
JOIN [#Fruit] ON
'%'+[Results].[Required]+'%' LIKE [#Fruit].[Name]
Ideally my output should be the following 3 rows:
I want one grape
I want one orange
I want one apple
If that kind of think is doable, I would try the other way round:
SELECT * FROM [Results]
JOIN [#Fruit] ON
[Results].[Required] LIKE '%'+[#Fruit].[Name]+'%'
This topic interests me, so I did a little bit of searching.
Suggestion 1 : Full Text Search
I think what you are trying to do is Full Text Search .
You will need Full-Text Index created on the table if it is not already there. ( Create FULLTEXT Index ).
This should be faster than performing "Like".
Suggestion 2 : Meta Data Search
Another approach I'd take is to create meta data table, and maintain the information myself when the [Result].Required values are updated(or created).
This looks more or less doable, but I'd start from the Fruit table just for conceptual clarity.
Here's roughly how I would structure this, ignoring all performance / speed / normalization issues (note also that I've switched around the variables in the LIKE comparison):
SELECT f.name, r.required
FROM fruits f
JOIN results r ON r.required LIKE CONCAT('%', f.name, '%')
...and perhaps add a LIMIT 10 to keep the query from wasting time while you're testing it out.
This structure will:
give you one record per "match" (per Result row that matches a Fruit)
exclude Result rows that don't have a Fruit
probably be ungodly slow.
Good luck!

Query datastore for the set of property values present

I have a property column which can have a subset of the following values at any point in time: { a | b | c | d | e }. By this, I mean that sometimes it can be any of { a | d | e }, or at another time it can even be { x | y | z }. How do I query the datastore, so that I can find out what subset is present at that point in time, without having to dig into each entity?
Presently I'm doing it this way:
people = Person.all().fetch(100)
city = set()
for p in people:
city.add(p.address)
I want to get the set of property values that are present at this point in time (i.e. no duplicates). For example, at one point in time all 5,000,000 people have an address of { Manila | Cebu | Davao }, then I want the set(Manila, Cebu, Davao).
At another point in time, all 5,000,000 people will have an address of { Iloilo | Laoag }, then I want the set(Iloilo, Laoag).
Before any query, I would not know what the set should compose of.
My present method requires that I dig through all the entities. It's terribly inefficient, any better way?
In AppEngine, it's almost always better to generate and store what you might need, during write time.
So in your use case, every time you add or edit a person entity, you add the city they are in to another model that lists all the cities, and then store that cities entity as well.
class Cities(db.Model):
list_of_cities = db.TextProperty(default="[]") #we'll use a stringified json list of cities
#when creating a new person / or when editing
person = Person(city = city)
cities = Cities.all().get() #there's only one model that we'll use.
list_of_cities = simplejson.loads(cities.list_of_cities)
if city not in list_of_cities:
list_of_cities.append(city) #add to the list of cities
cities.list_of_cities = simplejson.dumps(list_of_cities)
db.put(cities)
person.put()
You may want to use memcache on your cities entity to speed things up a bit. If you are also expecting to add more than one person in bursts of more than 1 write / second, then you might need to also consider sharding your list of cities.
An alternative to the approach suggested by Albert is to compute these values periodically using a mapreduce. The App Engine Mapreduce library makes this fairly straightforward. Your mapper will output the city (for instance) for each record, while the reducer will output the value and the number of times it occurs for each.

Retrieving data from 2 tables that have a 1 to many relationship - more efficient with 1 query or 2?

I need to selectively retrieve data from two tables that have a 1 to many relationship. A simplified example follows.
Table A is a list of events:
Id | TimeStamp | EventTypeId
--------------------------------
1 | 10:26... | 12
2 | 11:31... | 13
3 | 14:56... | 12
Table B is a list of properties for the events. Different event types have different numbers of properties. Some event types have no properties at all:
EventId | Property | Value
------------------------------
1 | 1 | dog
1 | 2 | cat
3 | 1 | mazda
3 | 2 | honda
3 | 3 | toyota
There are a number of conditions that I will apply when I retrieve the data, however they all revolve around table A. For instance, I may want only events on a certain day, or only events of a certain type.
I believe I have two options for retrieving the data:
Option 1
Perform two queries: first query table A (with a WHERE clause) and store data somewhere, then query table B (joining on table A in order to use same WHERE clause) and "fill in the blanks" in the data that I retrieved from table A.
This option requires SQL Server to perform 2 searches through table A, however the resulting 2 data sets contain no duplicate data.
Option 2
Perform a single query, joining table A to table B with a LEFT JOIN.
This option only requires one search of table A but the resulting data set will contain many duplicated values.
Conclusion
Is there a "correct" way to do this or do I need to try both ways and see which one is quicker?
Ex
Select E.Id,E.Name from Employee E join Dept D on E.DeptId=D.Id
and a subquery something like this -
Select E.Id,E.Name from Employee Where DeptId in (Select Id from Dept)
When I consider performance which of the two queries would be faster and why ?
would EXPECT the first query to be quicker, mainly because you have an equivalence and an explicit JOIN. In my experience IN is a very slow operator, since SQL normally evaluates it as a series of WHERE clauses separated by "OR" (WHERE x=Y OR x=Z OR...).
As with ALL THINGS SQL though, your mileage may vary. The speed will depend a lot on indexes (do you have indexes on both ID columns? That will help a lot...) among other things.
The only REAL way to tell with 100% certainty which is faster is to turn on performance tracking (IO Statistics is especially useful) and run them both. Make sure to clear your cache between runs!
More REF

An Efficient Lookup Table in redis--implemented using redis sets?

I want to use redis to store a large set of user_ids and with each of these
ids, a "group id" to which that user was previously assigned:
User_ID | Group_ID
1043 | 2
2403 | 1
The number of user_ids is fairly large (~ 10 million); the number of unique
group ids is about 3 - 5.
My purpose for this LuT is routine:
find the group id for a given user; and
return a list of other users (of specified length) with the same
group id as that given user
There might be an idiomatic way to do this in redis or at least a way that's most efficient. If so i would like to know what it is. Here's a simplified version of my working implementation (using the python client):
# assume a redis server is already running
# create some model data:
import numpy as NP
NUM_REG_USERS = 100
user_id = NP.random.randint(1000, 9999, NUM_REG_USERS)
cluster_id = NP.random.randint(1, 4, NUM_REG_USERS)
D = zip(cluster_id, user_id)
from redis import Redis
# r = Redis()
# populate the redis LuT:
for t in D :
r.sadd( t[0], t[1] )
# the queries:
# is user_id 1034 in Group 1?
r.sismember("1", 1034)
# return 10 users in the same Group 1 as user_id 1034:
r.smembers("1")[:10] # assume user_id 1034 is in group 1
So i have implemented this LuT using ordinary redis sets; each set is keyed to a Group ID (1, 2, or 3), so there are three sets in total.
Is this the most efficient way store this data given the type of queries i want to run against it?
Using sets is a good basic approach, though there are a couple of things in there you may want to change:
Unless you store the group ID for each a user somewhere you will need 5 round trips to get the group for a particular user - the operation itself is O(1), but you still need to consider latency. Usually it is fairly easy to do this without too much effort - you have lots of other properties stored for each user, so it is trivial to add one for group id.
You probably want SRANDMEMBER rather than SMEMBERS - I think SMEMBERS will return the same 10 items from your million item set every time.

Resources