Django: efficient database search - database

I need an efficient way to search through my models to find a specific User, here's a list,
User - list of users, their names, etc.
Events - table of events for all users, on when they're not available
Skills - many-to-many relationship with the User, a User could have a lot of skills
Contracts - many-to-one with User, a User could work on multiple contracts, each with a rating (if completed)
... etc.
So I got a lot of tables linked to the User table. I need to search for a set of users fitting certain criteria; for example, he's available from next Thurs through Fri, has x/y/z skills, and has received an average 4 rating on all his completed contracts.
Is there some way to do this search efficiently while minimizing the # of times I hit the database? Sorry if this is a very newb question.
Thanks!

Not sure if this method will solve you issue for all 4 cases, but at least it should help you out in the first one - querying users data efficiently.
I usually find using values or values_list query function faster because it slims down the SELECT part of the actual SQL, and therefore you will get results faster. Django docs regarding this.
Also worth mentioning that starting with new dev version within values and values_list you can query any type of relationship, including many_to_one.
And finally you might find in_bulk also useful. If I do a complex query, you might try to query the ids first of some models using values or values_list and then use in_bulk to get the model instances faster. Django docs about that.

Related

Recursive design in NoSQL

How could I express the following design?
There are two entities: user and group
Group can have users and other groups
User can't have other users or groups
Efficiently query any group and everything it contains
There are conceptually no depth limits (current hardware dictates it, e.g 5 for query speed)
Examples:
I need to use NoSQL and also be able to cache this data (Redis for example, which is NoSQL itself).
---
My current idea:
Every group is a single unit and only contains children (users and groups) IDs. Then I query all the children by IDs. If some of these also have children, I'll make another roundtrip and so on and on..
As you can imagine, this solution requires multiple queries and the amount increases with every "level of deepness". The good news is that I query all these items by ID which should be extremely fast.
Can anyone suggest a better way?
I would use a graph database as they are very powerful when dealing with this kind of queries.
Bear in mind that you won't be able to query the "parents" of a node though.
You could use Neo4j for this. They have a community edition that is free. https://neo4j.com/

Filtering Functionality Similar to Ebay SQL Count Issue

I am stuck on a database problem for a client, wandering if someone could help me out. I am currently trying to implement filtering functionality so that a user can filter results after they have searched for something. We are using SQL Server 2008. I am working on an electronics e-commerce site and the database is quite large (500,000 plus records). The scenario is this - user goes to our website and types in 'laptop' and clicks search. This brings up the first page of several thousand results. What I want to do is then
filter these results further and present the user with options such as:
Filter By Manufacturer
Dell (10,000)
Acer (2,000)
Lenovo (6,000)
Filter By Colour
Black (7000)
Silver (2000)
The main columns of the database are like this - the primary key is an integer ID
ID Title Manufacturer Colour
The key part of the question is how to get the counts in various categories in an efficient manner. The only way I currently know how to do it is with separate queries. However, should we wish to filter by further categories then this will become very slow - especially as the database grows. My current SQL is this:
select count(*) as ManufacturerCount, Manufacturer from [ProductDB.Product] GROUP BY Manufacturer;
select count(*) as ColourCount, Colour from [ProductDB.Product] GROUP BY Colour;
My question is if I can get the results as a single table using some-kind of join or union and if this would be faster than my current method of issuing multiple queries with the Count(*) function. Thanks for your help, if you require any further information please ask. PS I am wandering how on sites like ebay and amazon manage to do this so fast. In order to understand my problem better if you go onto ebay and type in laptop you will
see a number of filters on the left - this is basically what I am trying to achieve. I don't know how it can be done efficiently when there are many filters. E.g to get functionality equivalent to Ebay I would need about 10 queries and I'm sure that will be slow. I was thinking of creating an intermediate table with all the counts however the intermediate table would have to be continuously updated in order to reflect changes to the database and that would be a problem if there are multiple updates per minute. Thanks.
The "intermediate table" is exactly the way to go. I can guarantee you that no e-commerce site with substantial traffic and large number of products would do what you are suggesting on the fly at every inquiry.
If you are worried about keeping track of changes to products, just do all changes to the product catalog thru stored procs (my preferred method) or else use triggers.
One complication is how you will group things in the intermediate table. If you are only grouping on pre-defined categories and sub-categories that are built into the product hierarchy, then it's fairly easy. It sounds like you are allowing free-text search... if so, how will you manage multiple keywords that result in an unexpected intersection of different categories? One way is to save the keywords searched along with the counts and a time stamp. Then, the next time someone searches on the same keywords, check the intermediate table and if the time stamp is older than some predetermined threshold (say, 5 minutes), return your results to a temp table, query the category counts from the temp table, overwrite the previous counts with the new time stamp, and return the whole enchilada to the web app. Otherwise, skip the temp table and just return the pre-aggregated counts and data records. In this case, you might get some quirky front-end count behavior, like it might say "10 results" in a particular category but then when the user drills down, they actually find 9 or 11. It's happened to me on different sites as a customer and it's really not a big deal.
BTW, I used to work for a well-known e-commerce company and we did things like this.

Second Order Relationship in Graph Database

I'm creating an app which is quite relationship heavy. One of the features of the site is a recommendation feature, where users can rate things for others. For this, it seems like a Graph DB would be ideal so I am planning on using Neo4j, alongside Ruby.
This all seems fairly straight forward, however I would like to include a feature where users can rate a specific relationship. For example, a user could recommend a hotdog in a specific restaurant, etc. The only way I can really think about doing this with a Graph DB is to either add a 'joining node' between the two nodes, connecting all three, or by adding lists of properties to the relationship (ie adding hotdog_5 to the user-restaurant relationship). Obviously the rating could just be added to the hotdog-restaurant relationship, but you wouldn't be able to trace the users that rated it, to prevent them rating more than once.
Any thoughts on the problem would be appreciated.
You may want to retrieve all the comments from a user, or the comments about hotdogs in all restaurant, or all the comments about all type of food in a restaurant so I would recommend to do it like :
1. user-[:write]->comment
2. comment-[:about]->hotdog
3. comment-[:concern]->restaurant
4.restaurant-[serve]->hotdog
Not sure about the last one it may be useless due to 2 and 3, it depend a lot on the queries you'll run

Choosing the right model for storing and querying data?

I am working on my first GAE project using java and the datastore. And this is my first try with noSQL database. Like a lot of people i have problems understanding the right model to use. So far I've figured out two models and I need help to choose the right one.
All the data is represented in two classes User.class and Word.class.
User: couple of string with user data (username, email.....)
Word: two strings
Which is better :
Search in 10 000 000 entities for the 100 i need. For instance every entity Word have a string property owner and i query (owner = ‘John’).
In User.class i add property List<Word> and method getWords() that returns the list of words. So i query in 1000 users for the one i need and then call method like getWords() that returns List<Word> with that 100 i need.
Which one uses less resources ? Or am i going the wrong way with this ?
The answer is to use appstats and you can find out:
AppStats
To keep your application fast, you need to know:
Is your application making unnecessay RPC calls? Should it be caching
data instead of making repeated RPC calls to get the same data? Will
your application perform better if multiple requests are executed in
parallel rather than serially?
Run some tests, try it both ways and see what appstats says.
But I'd say that your option 2) is better simply because you don't need to search millions of entities. But who knows for sure? The trouble is that "resources" are a dozen different things in app engine - CPU, datastore reads, datastore writes etc etc etc.
For your User class, set a unique ID for each user (such as a username or email address). For the Word class, set the parent of each Word class as a specific User.
So, if you wanted to look up words from a specific user, you would do an ancestor query for all words belonging to that specific user.
By setting an ID for each user, you can get that user by ID as opposed to doing an additional query.
More info on ancestor queries:
https://developers.google.com/appengine/docs/java/datastore/queries#Ancestor_Queries
More info on IDs:
https://developers.google.com/appengine/docs/java/datastore/entities#Kinds_and_Identifiers
It really depends on the queries you're using. I assume that you want to find all the words given a certain owner.
Most likely, 2 would be cheaper, since you'll need to fetch the user entity instead of running a query.
2 will be a bit more work on your part, since you'll need to manually keep the list synchronized with the instances of Word
Off the top of my head I can think of 2 problems with #2, which may or may not apply to you:
A. If you want to find all the owners given a certain word, you'll need to keep that list of words indexed. This affects your costs. If you mostly find words by owner, and rarely find owners by words, it'll still make sense to do it this way. However, if your search pattern flips around and you're searching for owners by words a lot, this may be the wrong design. As you see, you need to design the models based on the queries you will be using.
B. Entities are limited to 1MB, and there's a limit on the number of indexed properties (5000 I think?). Those two will limit the number of words you can store in your list. Make sure that you won't need more than that limit of words per user. Method 1 allows you unlimted words per user.

How to best combine data from key-value stores and databases

Let's assume we have a friend list table for a social network.
Most use cases will require the friend list table to be JOINed to another table where you hold the personal details, such as: Name, Age, City, Profile picture URL, Last login time, etc...
Once the friend list table is in the 100M rows range. Querying a JOIN like this can take a few seconds. If you introduce a few other WHERE conditions it can even be slower.
A key-value store systems can bring in the friend list very quickly.
Let's assume we would like to show the 10 most recently logged in friends of a user.
What is the best way to calculate this output? A few methods I've been thinking about are below. Do any of them make sense?
Shall we keep all data in the key-value store environment? Update the
key-value store with every new login?
Or shall we pull the friend list id's first. Then use a database command like "IN()" and query the database?
Merge the data at the client level? A javascript solution?
In your Users table you have a field to save a timestamp for last login. In your table were the friend-relationships are stored you have 1 row per relationship and that makes the table really long.
So joining these tables seems bad and we should optimize this process somehow? The answer is: No, not necessarily. The people who construct a DBMS have the same problems as you and they implement the tools to solve them. Every DBMS has some sort of query optimization which is smarter than you and me.
So there's no shame in joining long tables. If you want to try to optimize you may:
Get the IDs of the friends of the user.
Get the information you want of the first 10 friends sorted by last_login desc where the id fits (and other where conditions).
You don't need to join the tables, but you will use two queries, so maybe if your DBMS is smart a join is faster (Maybe run a test).
If you want to, you can use ajax to load this data after the page was loaded, this improve the experience for the user, but the traffic on the DB will be the same.
I hope this helped.
Edit: Oh yeah, if you already knew the friends IDs (you need them for other stuff) you wouldn't even need a join. You can pass the IDs over to the javascript which loads the last login list later via AJAX.

Resources