An Efficient Lookup Table in redis--implemented using redis sets? - database

I want to use redis to store a large set of user_ids and with each of these
ids, a "group id" to which that user was previously assigned:
User_ID | Group_ID
1043 | 2
2403 | 1
The number of user_ids is fairly large (~ 10 million); the number of unique
group ids is about 3 - 5.
My purpose for this LuT is routine:
find the group id for a given user; and
return a list of other users (of specified length) with the same
group id as that given user
There might be an idiomatic way to do this in redis or at least a way that's most efficient. If so i would like to know what it is. Here's a simplified version of my working implementation (using the python client):
# assume a redis server is already running
# create some model data:
import numpy as NP
NUM_REG_USERS = 100
user_id = NP.random.randint(1000, 9999, NUM_REG_USERS)
cluster_id = NP.random.randint(1, 4, NUM_REG_USERS)
D = zip(cluster_id, user_id)
from redis import Redis
# r = Redis()
# populate the redis LuT:
for t in D :
r.sadd( t[0], t[1] )
# the queries:
# is user_id 1034 in Group 1?
r.sismember("1", 1034)
# return 10 users in the same Group 1 as user_id 1034:
r.smembers("1")[:10] # assume user_id 1034 is in group 1
So i have implemented this LuT using ordinary redis sets; each set is keyed to a Group ID (1, 2, or 3), so there are three sets in total.
Is this the most efficient way store this data given the type of queries i want to run against it?

Using sets is a good basic approach, though there are a couple of things in there you may want to change:
Unless you store the group ID for each a user somewhere you will need 5 round trips to get the group for a particular user - the operation itself is O(1), but you still need to consider latency. Usually it is fairly easy to do this without too much effort - you have lots of other properties stored for each user, so it is trivial to add one for group id.
You probably want SRANDMEMBER rather than SMEMBERS - I think SMEMBERS will return the same 10 items from your million item set every time.

Related

Nested FutureBuilder vs nested calls for lazy loading from database

I need to choose best approach between two approaches that I can follow.
I have a Flutter app that use sqflite to save data, inside the database I have two tables:
Employee:
+-------------+-----------------+------+
| employee_id | employee_name |dep_id|
+-------------+-----------------+------+
| e12 | Ada Lovelace | dep1 |
+-------------+-----------------+------+
| e22 | Albert Einstein | dep2 |
+-------------+-----------------+------+
| e82 | Grace Hopper | dep3 |
+-------------+-----------------+------+
SQL:
CREATE TABLE Employee(
employee_id TEXT NOT NULL PRIMARY KEY,
employee_name TEXT NOT NULL ,
dep_id TEXT,
FOREIGN KEY(dep_id) REFERENCES Department(dep_id)
ON DELETE SET NULL
);
Department:
+--------+-----------+-------+
| dep_id | dep_title |dep_num|
+--------+-----------+-------+
| dep1 | Math | dep1 |
+--------+-----------+-------+
| dep2 | Physics | dep2 |
+--------+-----------+-------+
| dep3 | Computer | dep3 |
+--------+-----------+-------+
SQL:
CREATE TABLE Department(
dep_id TEXT NOT NULL PRIMARY KEY,
dep_title TEXT NOT NULL ,
dep_num INTEGER,
);
I need to show a ListGrid of departments that are stored in the Employee table. I should look at Employee table and fetch department id from it, This is easy but after fetching that dep_id I need to make a card from those ids so I need information from Department table.
complete inforamtion for thoses id I had fetched from Emplyee table is inside Department table.
There are thousands of rows in each table.
I have a database helper class to connect to the database :
DbHelper is something like this:
Future<List<String>> getDepartmentIds() async{
'fetch all dep_id from Employee table'
}
Future<Department> getDepartment(String id) async{
'fetch Department from Department table for a specific id'
}
Future<List<Department>> getEmployeeDepartments() async{
'''1.fetch all dep_id from Employee table
2.for each id fetch Department records from Department table'''
var ids = await getDepartmentIds();
List<Departments> deps=[];
ids.forEach((map) async {
deps.add(await getDepartment(map['dep_id']));
});
}
There is two approaches:
First One:
Define a function in dbhelper that returns all dep_id from Employee table(getDepartmentIds and another function that returns a department object(model) for that specific id.(getDepartment)
Now I need two FutureBuilder inside each other, one for fetching ids and the other one for fetching department model.
second One:
Define a function that first fetch ids then inside that function each id is maped to department model.(getEmployeeDepartments)
So I need one FutureBuilder .
Which one is better??
should I let FutureBuilders handle it or I should put pressure on dbHelper to habdle it?
If I use the first approach then I have to(as far as I can imagine!) put the the second future call(the one that fetch Department Object(model) based on it's id(getDepartment)) on build function and it's recommended no to do so.
And the problem with second one is that it does a lot of nested call in dbHelper.
I used ListView.builder for performance.
I checked both with some data but couldn't figure out which one is better. I guess it depends both on flutter and sqlite(sqflite).
which one is better or is there any better approach?
Given that I don't see too much code on this example, I'll do a high-level answer on your questions.
Evaluate Approach One
Right off the bat this part sticks out: "returns all dep_id from Employee table"
I would say scratch that, since "return all" is typically never a good solution, especially since you mention your tables have a lot of rows.
Evaluate Approach Two
I'm not sure what the difference in performance this has compared to the first approach, seems also bad for the same reasons. I think this one just changes your UI logic a big is all.
Typical 'Endless' List Approach
You would do a query on the Employees table with a join to the Departments table.
You would implement Pagination on your UI and pass in your values to the query from step one.
At a basic level you'll need these variables: Take, Skip, HasMore
Take: The count # of items to request each query
Skip: The count # of items to skip on the next query, this will be the size of the number of items you currently have in your List in memory driving your UI.
HasMore: You can set this on the response of each query, to let the UI know if there are still more items or not.
As you scroll down the list, when you get to the bottom, you will request more items .
Initially issue a query for example: Take: 10, Skip: 0
Next query when you hit the bottom of the UI: Take: 10, Skip: 10
etc..
Example sql query:
SELECT *
FROM Employees E
JOIN Departments D on D.id = E.dept_id
order by E.employee_name
offset {SKIP#} rows
FETCH NEXT {TAKE#} rows only
Hopefully, this helps, I'm not fully sure what you're trying to do actually - in terms of Code.
As far as I can tell, what you're looking to do is get a list of employees with relevant info including department.
If that's the case, then it's tailor made for INNER JOIN. Something like this:
SELECT Employee.*, Department.dep_id, Department.dep_title
FROM Employee INNER JOIN Department
ON Employee.dep_id = Department.dep_id;
(although you may want to double check that, my SQL is a bit rusty).
This would do what you need in one step. However, there is still the issue of what you're asking which seems to be "Is it more efficient to do many small requests or one big one, and what are the performance ramifications".
The answer to that is a bit specific to Flutter. What's happening when you do a request with SQFLITE, is that it is processing whatever you've passed to it, sending it to java/objc and possibly doing more processing and pushes processing to a backround thread, which then calls to the SQLITE library which does more processing to understand the request, then actually reads the data on the disk to do the operation, then returns back to the java/objc layer, which pushes the response to the UI thread, which in turns responds back to dart.
If that doesn't sound particularly efficient, that's because it isn't =D. If you're doing this a few times (or even a few hundred) it's probably fine, but if you're getting into thousands as you state it might start slowing down.
The alternative you've proposed is to do one large request. You will know better than I whether that is wise; if it's a couple thousand but only ever a couple thousand, and the data you're returning is always going to be relatively small (i.e. just a 10-20 character name and department name), then you'll have say (20+20)*2000 = 8000b = 80kb of data. Even if you assume the overhead will double that size, 160 kb of data shouldn't be enough to faze any relatively recent smartphone (after all that's much smaller than any single photo!).
Now, taking some domain specific knowledge, you could optimize this. For example, if you know the number of departments is much smaller than employees (i.e. < 100 or something), you could skip the entire issue of doing joins, and simply request all departments before this begins and put it in a map (dep_id => dep_title), and then once you've requested employees you could just simply do that lookup from dep_id to dep_title yourself. That way your requests wouldn't have to include the dep_title over and over again.
That being said, you may want to consider paging the employee lookup whether or not you use a join. You'd do this by requesting 100 employees (or whatever number) at a time rather than the entire batch - that way you don't have the overhead of 1000+ calls through the stack, but you also don't have a large block of data all in memory all at once.
SELECT * FROM Employee
WHERE employee_name >= LastValue
ORDER BY employee_name
LIMIT 100;
Unfortunately that doesn't fit in as well with how flutter does lists, so you'd probably need to have something like a 'EmployeeDatabaseManager' that does the actual requests, and your list would call into it to get the data. That's probably beyond the scope of this question though.

Work around the SQLite parameter limit in a select query

I have a GUI application with a list of people which contains the person's database id and their attributes. Something like this:
+----+------+
| ID | Name |
+----+------+
| 1 | John |
| 2 | Fred |
| 3 | Mary |
[...]
This list can be filtered, so the amount and type of people depend from time to time. To get a list of Peewee Person objects I first get the list of visible IDs and use the following query:
ids = [row[0] for row in store]
Person.select().where(Person.id.in_(ids))
Which in turn translates to the following SQL:
('SELECT "t1"."id", "t1"."name" FROM "person" AS "t1" WHERE ("t1"."id" IN (?, ?, ?, ...))', [1, 2, 3, ...])
This throws an OperationalError: too many SQL variables error on Windows with more than 1000 people. This is documented in the Peewee and SQLite docs. Workarounds given online usually relate to bulk inserts and ways to split the action in chunks. Is there any way to work around this limitation with the mentioned SELECT ... WHERE ... IN query?
Getting the separate objects in a list comprehension is too slow:
people = [Person.get_by_id(row[0]) for row in store]
Maybe split the list of IDs in max 1000 items, use the select query on each chunk and then combine those somehow?
Where are the IDs coming from? The best answer is to avoid using that many parameters, of course. For example, if your list of IDs could be represented as a query of some sort, then you can just write a subquery, e.g.
my_friends = (Relationship
.select(Relationship.to_user)
.where(Relationship.from_user == me))
tweets_by_friends = Tweet.select().where(Tweet.user.in_(my_friends))
In the above, we could get all the user IDs from the first query and pass them en-masse as a list into the second query. But since the first query ("all my friends") is itself a query, we can just compose them. You could also use a JOIN instead of a subquery, but hopefully you get the point.
If this is not possible and you seriously have a list of >1000 IDs...how is such a list useful in a GUI application? Over 1000 anything is quite a lot of things.
To try and answer the question you asked -- you'll have to chunk them up. Which is fine. Just:
user_ids = list_of_user_ids
accum = []
# 100 at a time.
for i in range(0, user_ids, 100):
query = User.select().where(User.id.in_(user_ids[i:i+100]))
accum.extend([user for user in query])
return accum
But seriously, I think there's a problem with the way you're implementing this that makes it even necessary to filter on so many ids.

How to model arbitrarily ordering items in database?

I accepted a new feature to re-order some items by using Drag-and-Drop UI and save the preference for each user to the database. What's the best way to do so?
After reading some questions on StackOverflow, I found this solution.
Solution 1: Use decimal numbers to indicate order
For example,
id item order
1 a 1
2 b 2
3 c 3
4 d 4
If I insert item 4 between item 1 and 2, the order becomes,
id item order
1 a 1
4 d 1.5
2 b 2
3 c 3
In this way, every new order = order[i-1] + order[i+1] / 2
If I need to save the preference for every user, then I need to another relationship table like this,
user_id item_id order
1 1 1
1 2 2
1 3 3
1 4 1.5
I need num_of_users * num_of_items records to save this preference.
However, there's a solution I can think of.
Solution 2: Save the order preference in a column in the User table
This is straightforward by adding a column in the User table to record the order. Each value would be parsed as an array of item_ids that ranked by the index of the array.
user_id . item_order
1 [1,4,2,3]
2 [1,2,3,4]
Is there any limitation of this solution? Or is there any other ways to solve this problem?
Usually, an explicit ordering deals with the presentation or some specific processing of data. Hence, it's a good idea to separate entities of theirs presentation/processing. For example
users
-----
user_id (PK)
user_login
...
user_lists
----------
list_id, user_id (PK)
item_index
item_index can be a simply integer value :
ordered continuously (1,2...N): DELETE/INSERT of the whole list are normally required to change the order
ordered discretely with some seed (10,20...N): you can insert new items without reordering the whole list
Another reason to separate entity data and lists: reordering lists should be done in transaction that may lead to row/table locks. In case of separated tables only data in list table is impacted.

Best way to store results data in database? [duplicate]

This question already has answers here:
Is storing a delimited list in a database column really that bad?
(10 answers)
Closed 9 years ago.
I have results data like this:
1. account, name, #, etc
2. account, name, #, etc
...
10. account, name, #, etc
I have approximately 1 set of results data generated each week.
Currently it's stored like so:
DATETIME DATA_BLOB
Which is annoying because I can't query any of the data without parsing the BLOB into a custom object. I'm thinking of changing this.
I'm thinking of having one giant table:
DATETIME RANK ACCOUNT NAME NUMBER ... ETC
date1 1 user1 nn #
date1 2 user2 nn #
...
date1 10 userN nn #
date2 1 user5 nn #
date2 2 user12 nn #
...
date2 10 userX nn #
I don't know anything about database design principles, so can someone give me feedback on whether this is a good approach or there might be a better one?
Thanks
I think it is ok to have a table like that, if there are not one-to-many relationships. In that case, it would be more efficient to have multiple tables like in my example below. Here are some general tips as well:
Tip: Good practice My professor told me that it's always good to have an "ID" column, which is a unique number identifier for each item in the table (1, 2, 3… etc.). (Perhaps that was the intent of your "Number" column.) I think SQLite forces each table to have an ID column anyways.
Tip: Saving storage space - Also, if there is a one-to-many relationship (example: one name has many accounts) then it might save space to have a separate table for the accounts, and then store the ID of the name in the first table- so that way you are storing many ints instead of duplicate strings.
Tip: Efficiency - Some databases have specific frameworks designed to handle relationships such as many-to-one or many-to-many, so if you use their framework for that (I don't remember exactly how to do it) it will probably work more efficiently.
Tip: Saving storage space - If you make your own ID column it might be a waste if it automatically includes an "ID" column anyways - so you might want to check for that possibility.
Conceptual Example: (Storing multiple accounts for the same name)
Poor Solution:
Storing everything in 1 table (inefficient, because it duplicates Bob's name, rank, and datetime):
ID NAME RANK DATETIME ACCOUNT
1 Bob 1 date1 bob_account_1
2 Joe 2 date2 user2_joe
3 Bob 1 date1 bob_account_2
4 Bob 1 date1 bobs_third_account
Better Solution: Having 2 tables to prevent duplicated information (Also demonstrates the usefulness of ID's). I named the 2 tables "Account" and "Name."
Table 1: "Account" (Note that NAME_ID refers to the ID column of Table 2)
ID NAME_ID ACCOUNT
1 1 bob_account_1
2 2 user2_joe
3 1 bob_account_2
4 1 bobs_third_account
Table 2: "Name"
ID NAME RANK DATETIME
1 Bob 1 date1
2 Joe 2 date2
I'm not a database expert so this is just some of what I learned in my internet programming class. I hope this helps lead you in the right direction in further research.

Get "surrounding" rows in NHibernate query

I am looking for a way to retrieve the "surrounding" rows in a NHibernate query given a primary key and a sort order?
E.g. I have a table with log entries and I want to display the entry with primary key 4242 and the previous 5 entries as well as the following 5 entries ordered by date (there is no direct relation between date and primary key). Such a query should return 11 rows in total (as long as we are not close to either end).
The log entry table can be huge and retrieving all to figure it out is not possible.
Is there such a concept as row number that can be used from within NHibernate? The underlying database is either going to be SQlite or Microsoft SQL Server.
Edited Added sample
Imagine data such as the following:
Id Time
4237 10:00
4238 10:00
1236 10:01
1237 10:01
1238 10:02
4239 10:03
4240 10:04
4241 10:04
4242 10:04 <-- requested "center" row
4243 10:04
4244 10:05
4245 10:06
4246 10:07
4247 10:08
When requesting the entry with primary key 4242 we should get the rows 1237, 1238 and 4239 to 4247. The order is by Time, Id.
Is it possible to retrieve the entries in a single query (which obviously can include subqueries)? Time is a non-unique column so several entries have the same value and in this example is it not possible to change the resolution in a way that makes it unique!
"there is no direct relation between date and primary key" means, that the primary keys are not in a sequential order?
Then I would do it like this:
Item middleItem = Session.Get(id);
IList<Item> previousFiveItems = Session.CreateCriteria((typeof(Item))
.Add(Expression.Le("Time", middleItem.Time))
.AddOrder(Order.Desc("Time"))
.SetMaxResults(5);
IList<Item> nextFiveItems = Session.CreateCriteria((typeof(Item))
.Add(Expression.Gt("Time", middleItem.Time))
.AddOrder(Order.Asc("Time"))
.SetMaxResults(5);
There is the risk of having several items with the same time.
Edit
This should work now.
Item middleItem = Session.Get(id);
IList<Item> previousFiveItems = Session.CreateCriteria((typeof(Item))
.Add(Expression.Le("Time", middleItem.Time)) // less or equal
.Add(Expression.Not(Expression.IdEq(middleItem.id))) // but not the middle
.AddOrder(Order.Desc("Time"))
.SetMaxResults(5);
IList<Item> nextFiveItems = Session.CreateCriteria((typeof(Item))
.Add(Expression.Gt("Time", middleItem.Time)) // greater
.AddOrder(Order.Asc("Time"))
.SetMaxResults(5);
This should be relatively easy with NHibernate's Criteria API:
List<LogEntry> logEntries = session.CreateCriteria(typeof(LogEntry))
.Add(Expression.InG<int>(Projections.Property("Id"), listOfIds))
.AddOrder(Order.Desc("EntryDate"))
.List<LogEntry>();
Here your listOfIds is just a strongly typed list of integers representing the ids of the entries you want to retrieve (integers 4242-5 through 4242+5 ).
Of course you could also add Expressions that let you retrieve Ids greater than 4242-5 and smaller than 4242+5.
Stefan's solution definitely works but better way exists using a single select and nested Subqueries:
ICriteria crit = NHibernateSession.CreateCriteria(typeof(Item));
DetachedCriteria dcMiddleTime =
DetachedCriteria.For(typeof(Item)).SetProjection(Property.ForName("Time"))
.Add(Restrictions.Eq("Id", id));
DetachedCriteria dcAfterTime =
DetachedCriteria.For(typeof(Item)).SetMaxResults(5).SetProjection(Property.ForName("Id"))
.Add(Subqueries.PropertyGt("Time", dcMiddleTime));
DetachedCriteria dcBeforeTime =
DetachedCriteria.For(typeof(Item)).SetMaxResults(5).SetProjection(Property.ForName("Id"))
.Add(Subqueries.PropertyLt("Time", dcMiddleTime));
crit.AddOrder(Order.Asc("Time"));
crit.Add(Restrictions.Eq("Id", id) || Subqueries.PropertyIn("Id", dcAfterTime) ||
Subqueries.PropertyIn("Id", dcBeforeTime));
return crit.List<Item>();
This is NHibernate 2.0 syntax but the same holds true for earlier versions where instead of Restrictions you use Expression.
I have tested this on a test application and it works as advertised

Resources