Query datastore for the set of property values present - google-app-engine

I have a property column which can have a subset of the following values at any point in time: { a | b | c | d | e }. By this, I mean that sometimes it can be any of { a | d | e }, or at another time it can even be { x | y | z }. How do I query the datastore, so that I can find out what subset is present at that point in time, without having to dig into each entity?
Presently I'm doing it this way:
people = Person.all().fetch(100)
city = set()
for p in people:
city.add(p.address)
I want to get the set of property values that are present at this point in time (i.e. no duplicates). For example, at one point in time all 5,000,000 people have an address of { Manila | Cebu | Davao }, then I want the set(Manila, Cebu, Davao).
At another point in time, all 5,000,000 people will have an address of { Iloilo | Laoag }, then I want the set(Iloilo, Laoag).
Before any query, I would not know what the set should compose of.
My present method requires that I dig through all the entities. It's terribly inefficient, any better way?

In AppEngine, it's almost always better to generate and store what you might need, during write time.
So in your use case, every time you add or edit a person entity, you add the city they are in to another model that lists all the cities, and then store that cities entity as well.
class Cities(db.Model):
list_of_cities = db.TextProperty(default="[]") #we'll use a stringified json list of cities
#when creating a new person / or when editing
person = Person(city = city)
cities = Cities.all().get() #there's only one model that we'll use.
list_of_cities = simplejson.loads(cities.list_of_cities)
if city not in list_of_cities:
list_of_cities.append(city) #add to the list of cities
cities.list_of_cities = simplejson.dumps(list_of_cities)
db.put(cities)
person.put()
You may want to use memcache on your cities entity to speed things up a bit. If you are also expecting to add more than one person in bursts of more than 1 write / second, then you might need to also consider sharding your list of cities.

An alternative to the approach suggested by Albert is to compute these values periodically using a mapreduce. The App Engine Mapreduce library makes this fairly straightforward. Your mapper will output the city (for instance) for each record, while the reducer will output the value and the number of times it occurs for each.

Related

Nested FutureBuilder vs nested calls for lazy loading from database

I need to choose best approach between two approaches that I can follow.
I have a Flutter app that use sqflite to save data, inside the database I have two tables:
Employee:
+-------------+-----------------+------+
| employee_id | employee_name |dep_id|
+-------------+-----------------+------+
| e12 | Ada Lovelace | dep1 |
+-------------+-----------------+------+
| e22 | Albert Einstein | dep2 |
+-------------+-----------------+------+
| e82 | Grace Hopper | dep3 |
+-------------+-----------------+------+
SQL:
CREATE TABLE Employee(
employee_id TEXT NOT NULL PRIMARY KEY,
employee_name TEXT NOT NULL ,
dep_id TEXT,
FOREIGN KEY(dep_id) REFERENCES Department(dep_id)
ON DELETE SET NULL
);
Department:
+--------+-----------+-------+
| dep_id | dep_title |dep_num|
+--------+-----------+-------+
| dep1 | Math | dep1 |
+--------+-----------+-------+
| dep2 | Physics | dep2 |
+--------+-----------+-------+
| dep3 | Computer | dep3 |
+--------+-----------+-------+
SQL:
CREATE TABLE Department(
dep_id TEXT NOT NULL PRIMARY KEY,
dep_title TEXT NOT NULL ,
dep_num INTEGER,
);
I need to show a ListGrid of departments that are stored in the Employee table. I should look at Employee table and fetch department id from it, This is easy but after fetching that dep_id I need to make a card from those ids so I need information from Department table.
complete inforamtion for thoses id I had fetched from Emplyee table is inside Department table.
There are thousands of rows in each table.
I have a database helper class to connect to the database :
DbHelper is something like this:
Future<List<String>> getDepartmentIds() async{
'fetch all dep_id from Employee table'
}
Future<Department> getDepartment(String id) async{
'fetch Department from Department table for a specific id'
}
Future<List<Department>> getEmployeeDepartments() async{
'''1.fetch all dep_id from Employee table
2.for each id fetch Department records from Department table'''
var ids = await getDepartmentIds();
List<Departments> deps=[];
ids.forEach((map) async {
deps.add(await getDepartment(map['dep_id']));
});
}
There is two approaches:
First One:
Define a function in dbhelper that returns all dep_id from Employee table(getDepartmentIds and another function that returns a department object(model) for that specific id.(getDepartment)
Now I need two FutureBuilder inside each other, one for fetching ids and the other one for fetching department model.
second One:
Define a function that first fetch ids then inside that function each id is maped to department model.(getEmployeeDepartments)
So I need one FutureBuilder .
Which one is better??
should I let FutureBuilders handle it or I should put pressure on dbHelper to habdle it?
If I use the first approach then I have to(as far as I can imagine!) put the the second future call(the one that fetch Department Object(model) based on it's id(getDepartment)) on build function and it's recommended no to do so.
And the problem with second one is that it does a lot of nested call in dbHelper.
I used ListView.builder for performance.
I checked both with some data but couldn't figure out which one is better. I guess it depends both on flutter and sqlite(sqflite).
which one is better or is there any better approach?
Given that I don't see too much code on this example, I'll do a high-level answer on your questions.
Evaluate Approach One
Right off the bat this part sticks out: "returns all dep_id from Employee table"
I would say scratch that, since "return all" is typically never a good solution, especially since you mention your tables have a lot of rows.
Evaluate Approach Two
I'm not sure what the difference in performance this has compared to the first approach, seems also bad for the same reasons. I think this one just changes your UI logic a big is all.
Typical 'Endless' List Approach
You would do a query on the Employees table with a join to the Departments table.
You would implement Pagination on your UI and pass in your values to the query from step one.
At a basic level you'll need these variables: Take, Skip, HasMore
Take: The count # of items to request each query
Skip: The count # of items to skip on the next query, this will be the size of the number of items you currently have in your List in memory driving your UI.
HasMore: You can set this on the response of each query, to let the UI know if there are still more items or not.
As you scroll down the list, when you get to the bottom, you will request more items .
Initially issue a query for example: Take: 10, Skip: 0
Next query when you hit the bottom of the UI: Take: 10, Skip: 10
etc..
Example sql query:
SELECT *
FROM Employees E
JOIN Departments D on D.id = E.dept_id
order by E.employee_name
offset {SKIP#} rows
FETCH NEXT {TAKE#} rows only
Hopefully, this helps, I'm not fully sure what you're trying to do actually - in terms of Code.
As far as I can tell, what you're looking to do is get a list of employees with relevant info including department.
If that's the case, then it's tailor made for INNER JOIN. Something like this:
SELECT Employee.*, Department.dep_id, Department.dep_title
FROM Employee INNER JOIN Department
ON Employee.dep_id = Department.dep_id;
(although you may want to double check that, my SQL is a bit rusty).
This would do what you need in one step. However, there is still the issue of what you're asking which seems to be "Is it more efficient to do many small requests or one big one, and what are the performance ramifications".
The answer to that is a bit specific to Flutter. What's happening when you do a request with SQFLITE, is that it is processing whatever you've passed to it, sending it to java/objc and possibly doing more processing and pushes processing to a backround thread, which then calls to the SQLITE library which does more processing to understand the request, then actually reads the data on the disk to do the operation, then returns back to the java/objc layer, which pushes the response to the UI thread, which in turns responds back to dart.
If that doesn't sound particularly efficient, that's because it isn't =D. If you're doing this a few times (or even a few hundred) it's probably fine, but if you're getting into thousands as you state it might start slowing down.
The alternative you've proposed is to do one large request. You will know better than I whether that is wise; if it's a couple thousand but only ever a couple thousand, and the data you're returning is always going to be relatively small (i.e. just a 10-20 character name and department name), then you'll have say (20+20)*2000 = 8000b = 80kb of data. Even if you assume the overhead will double that size, 160 kb of data shouldn't be enough to faze any relatively recent smartphone (after all that's much smaller than any single photo!).
Now, taking some domain specific knowledge, you could optimize this. For example, if you know the number of departments is much smaller than employees (i.e. < 100 or something), you could skip the entire issue of doing joins, and simply request all departments before this begins and put it in a map (dep_id => dep_title), and then once you've requested employees you could just simply do that lookup from dep_id to dep_title yourself. That way your requests wouldn't have to include the dep_title over and over again.
That being said, you may want to consider paging the employee lookup whether or not you use a join. You'd do this by requesting 100 employees (or whatever number) at a time rather than the entire batch - that way you don't have the overhead of 1000+ calls through the stack, but you also don't have a large block of data all in memory all at once.
SELECT * FROM Employee
WHERE employee_name >= LastValue
ORDER BY employee_name
LIMIT 100;
Unfortunately that doesn't fit in as well with how flutter does lists, so you'd probably need to have something like a 'EmployeeDatabaseManager' that does the actual requests, and your list would call into it to get the data. That's probably beyond the scope of this question though.

Work around the SQLite parameter limit in a select query

I have a GUI application with a list of people which contains the person's database id and their attributes. Something like this:
+----+------+
| ID | Name |
+----+------+
| 1 | John |
| 2 | Fred |
| 3 | Mary |
[...]
This list can be filtered, so the amount and type of people depend from time to time. To get a list of Peewee Person objects I first get the list of visible IDs and use the following query:
ids = [row[0] for row in store]
Person.select().where(Person.id.in_(ids))
Which in turn translates to the following SQL:
('SELECT "t1"."id", "t1"."name" FROM "person" AS "t1" WHERE ("t1"."id" IN (?, ?, ?, ...))', [1, 2, 3, ...])
This throws an OperationalError: too many SQL variables error on Windows with more than 1000 people. This is documented in the Peewee and SQLite docs. Workarounds given online usually relate to bulk inserts and ways to split the action in chunks. Is there any way to work around this limitation with the mentioned SELECT ... WHERE ... IN query?
Getting the separate objects in a list comprehension is too slow:
people = [Person.get_by_id(row[0]) for row in store]
Maybe split the list of IDs in max 1000 items, use the select query on each chunk and then combine those somehow?
Where are the IDs coming from? The best answer is to avoid using that many parameters, of course. For example, if your list of IDs could be represented as a query of some sort, then you can just write a subquery, e.g.
my_friends = (Relationship
.select(Relationship.to_user)
.where(Relationship.from_user == me))
tweets_by_friends = Tweet.select().where(Tweet.user.in_(my_friends))
In the above, we could get all the user IDs from the first query and pass them en-masse as a list into the second query. But since the first query ("all my friends") is itself a query, we can just compose them. You could also use a JOIN instead of a subquery, but hopefully you get the point.
If this is not possible and you seriously have a list of >1000 IDs...how is such a list useful in a GUI application? Over 1000 anything is quite a lot of things.
To try and answer the question you asked -- you'll have to chunk them up. Which is fine. Just:
user_ids = list_of_user_ids
accum = []
# 100 at a time.
for i in range(0, user_ids, 100):
query = User.select().where(User.id.in_(user_ids[i:i+100]))
accum.extend([user for user in query])
return accum
But seriously, I think there's a problem with the way you're implementing this that makes it even necessary to filter on so many ids.

Add label to node from a CSV file in NEO4J

I am trying to add some nodes to my graph database from a CSV, which suppose is like:
| city continent feature_1 ...
|--------------------------------------------------
0 | Barcelona Europe
1 | Stockholm Europe
2 | New York America
3 | Nairobi Africa
4 | Tokyo Asia
The first approach was to simply load this data as:
// Insert city nodes
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///city_data.csv' AS row
MERGE (city: City {name: row.city})
Next step was to incorporate the continent information, so I could have nodes of different colors. This means having two labels for each node, which is something I am not sure how to do. Anyway, for the moment I decided to simply have one label instead, which contained the continent information. Since this information is within the CSV file I believe apoc.create.node tool is the way to go. Hence, inspired by How to use apoc.load.csv in conjunction with apoc.create.node I tried the following:
CALL apoc.load.csv('file:///city_data.csv') YIELD row
CALL apoc.create.node(['row.continent'], {name:['row.continent']}) YIELD node
RETURN count(*)
This does not raise any error, but it does something different from what I was thinking of. It basically sets the column name ("row.continent") itself as the label...
The problem is that you surround the variable in apostrophes, so try this:
CALL apoc.load.csv('file:///city_data.csv') YIELD row
CALL apoc.create.node([row.continent], {name: row.continent}) YIELD node
RETURN count(*)

Is there a database for reverse lookups, which is having constraints in its documents and not at query time?

I have a problem in one of my projects. I have events that should apply to some users, but not all. The decision if a user is having one of the events applied to him is dynamically integrated into the document.
An example:
--------------------------------------------
| event | constraints |
--------------------------------------------
| kill | age > 67 |
| give_cake | latestGrade == 'A' |
--------------------------------------------
Now if I supply a sample to the database like this one it should return both documents.
{ "age" = 80, "latestGrade": "A" }
For this one it should only match one row:
{ "age" = 80 }
I know that is is rather specific and I know that I could just program this myself by iterating through all the documents and applying the constraints by myself.
I'm looking for a technology that works similar or that can be applied to this kind of problem. If there is no such thing known to you, you could give me an idea of how this can be solved in a way that is better than iterating through all records (maybe compile the constraints in some way?). Also the constraints should be combinable (AND/OR and Parantheses).
Right now we use Redis, Elasticsearch and PostgreSQL in the projects, so maybe there is some funky functionality inside these technologies already that I'm not aware of?

What's the most effective way of storing this data?

Need help figuring out a good way to store data effectively and efficiently
I'm using Parse (JavaScript SDK), here's an example of what I'm trying to store
Predictions of football (soccer) matches so an example of one match would be;
Team A v Team B
EventID = "abc"
Categories = ["League-1","Sunday-League"]
User123 predicts the score will be Team A 2-0 Team B -> so 2-0
User456 predicts the score will be Team A 1-3 Team B -> so 1-3
Each event has information attached to it like an eventId, several categories, start time, end time, a result and more
I need to record a score prediction per user for each event (usually 10 events at a time so a lot of predictions will be coming in)
I need to store these so I can cross reference the correct result against the user's prediction and award points based on their prediction, the teams in the match and the categories of the event but instead of adding to a total I need all the awarded points stored separately per category and per user so I can then filter based on predictions between set dates and certain categories e.g.
Team A v Team B
EventID = "abc"
Categories = ["League-1","Sunday-League"]
User123 prediction = 2-0
Actual result = 2-0
So now I need to award X points to User123 for Team A, Team B, "League-1", and "Sunday-League" and record it to the event date too.
I would suggest you create a table for games and a table for users and then an associative table to handle the many to many relationship. This is a pretty standard many to many relationship.

Resources