I am building a book review application.
In the Review model, I have book_id and several fields like author_rating and/or scary_rating.
In the Book model, I have a search() function that I'd like to use to search for books with certain characteristics, like an author_rating of above 5, for example.
What is the best way to accomplish this? I know these are probably wrong, but I could add the attributes (author_rating, scary_rating, etc) into the book model and update them (average them in) each time a review is submitted; or, I could run a cron task that updates those fields every so often.
But is there a better way where I could query both the Book and Review models to come up with Books and meet certain criteria defined by looking at the Reviews database?
Does that make sense?
you could use union, to combine two queries!!!
query 1......
union
query 2.....
order by ratings LIMIT 1
in this kind of format, if you want real life example, then im willing to give you one,
SELECT
u.username, u.picture,m.id, m.user_note, m.reply_id, m.reply_name, m.dt
FROM
relationships r,
notes m,
user u
WHERE
m.user_id = r.leader
AND
r.leader = u.user_id
AND
r.listener = '$user_id'
UNION
select username, picture,id, user_note, reply_id, reply_name, dt
from user u, notes b
where u.user_id = b.user_id
and
b.user_id ='$user_id'
ORDER BY dt DESC LIMIT 10";
It sounds like you have a bunch of reviews and you want to average them on the fly when someone searches, is this correct? If so, I think it would probably work better to save the averaged ratings as fields in the Book model.
The reason I say this is that it is likely your app will be doing more searching than saving reviews, and it's probably more important to have searches returned quickly than reviews saved quickly.
However, if you want to calculate these averages on the fly, you can use MySQL's AVG() function. Check this SO post for an example that I think is pretty close to your situation:
MySQL - Can I combine these 2 SQL statements? Combine JOIN and AVG?
Also, this is a situation where I'd probably just write the SQL and stick it into a query(), rather than trying to wrestle with Cake's ORM syntax:
http://book.cakephp.org/view/456/query
Related
I am stuck on a database problem for a client, wandering if someone could help me out. I am currently trying to implement filtering functionality so that a user can filter results after they have searched for something. We are using SQL Server 2008. I am working on an electronics e-commerce site and the database is quite large (500,000 plus records). The scenario is this - user goes to our website and types in 'laptop' and clicks search. This brings up the first page of several thousand results. What I want to do is then
filter these results further and present the user with options such as:
Filter By Manufacturer
Dell (10,000)
Acer (2,000)
Lenovo (6,000)
Filter By Colour
Black (7000)
Silver (2000)
The main columns of the database are like this - the primary key is an integer ID
ID Title Manufacturer Colour
The key part of the question is how to get the counts in various categories in an efficient manner. The only way I currently know how to do it is with separate queries. However, should we wish to filter by further categories then this will become very slow - especially as the database grows. My current SQL is this:
select count(*) as ManufacturerCount, Manufacturer from [ProductDB.Product] GROUP BY Manufacturer;
select count(*) as ColourCount, Colour from [ProductDB.Product] GROUP BY Colour;
My question is if I can get the results as a single table using some-kind of join or union and if this would be faster than my current method of issuing multiple queries with the Count(*) function. Thanks for your help, if you require any further information please ask. PS I am wandering how on sites like ebay and amazon manage to do this so fast. In order to understand my problem better if you go onto ebay and type in laptop you will
see a number of filters on the left - this is basically what I am trying to achieve. I don't know how it can be done efficiently when there are many filters. E.g to get functionality equivalent to Ebay I would need about 10 queries and I'm sure that will be slow. I was thinking of creating an intermediate table with all the counts however the intermediate table would have to be continuously updated in order to reflect changes to the database and that would be a problem if there are multiple updates per minute. Thanks.
The "intermediate table" is exactly the way to go. I can guarantee you that no e-commerce site with substantial traffic and large number of products would do what you are suggesting on the fly at every inquiry.
If you are worried about keeping track of changes to products, just do all changes to the product catalog thru stored procs (my preferred method) or else use triggers.
One complication is how you will group things in the intermediate table. If you are only grouping on pre-defined categories and sub-categories that are built into the product hierarchy, then it's fairly easy. It sounds like you are allowing free-text search... if so, how will you manage multiple keywords that result in an unexpected intersection of different categories? One way is to save the keywords searched along with the counts and a time stamp. Then, the next time someone searches on the same keywords, check the intermediate table and if the time stamp is older than some predetermined threshold (say, 5 minutes), return your results to a temp table, query the category counts from the temp table, overwrite the previous counts with the new time stamp, and return the whole enchilada to the web app. Otherwise, skip the temp table and just return the pre-aggregated counts and data records. In this case, you might get some quirky front-end count behavior, like it might say "10 results" in a particular category but then when the user drills down, they actually find 9 or 11. It's happened to me on different sites as a customer and it's really not a big deal.
BTW, I used to work for a well-known e-commerce company and we did things like this.
I have a products table in postgres that stores all the data I need. What I need to work out the best way to do:
Each product has a different status on the way to completion - machined, painted, assembled, etc. For each status there is a letter that changes in the product id.
What would the most efficient way of saving the data? For each status of the product should there be 'another product' in the table? Or would doing join tables somewhere work?
Example:
111a1 for machined
111b1 for painted
Yet these are the same end product, just at different stages ...
It depends on what you want to be efficient: storage, ingestion, queries, maintainability...
Joins would work - you can join on some substring of the product id, so you need not have separate products for every stage of production.
But maintainability of your code is really important. Businesses change - perhaps to include sub-assemblies.
You might want to re-think this scheme of altering the product id to show the status. Product ID and work flow state are orthogonal concepts. So you probably want to have them in separate fields. You'll probably write far less code that way. The alternative will be becoming really well acquainted with substr() (depending on your SQL dialect), and all sorts of duplications elsewhere.
Let's assume we have a friend list table for a social network.
Most use cases will require the friend list table to be JOINed to another table where you hold the personal details, such as: Name, Age, City, Profile picture URL, Last login time, etc...
Once the friend list table is in the 100M rows range. Querying a JOIN like this can take a few seconds. If you introduce a few other WHERE conditions it can even be slower.
A key-value store systems can bring in the friend list very quickly.
Let's assume we would like to show the 10 most recently logged in friends of a user.
What is the best way to calculate this output? A few methods I've been thinking about are below. Do any of them make sense?
Shall we keep all data in the key-value store environment? Update the
key-value store with every new login?
Or shall we pull the friend list id's first. Then use a database command like "IN()" and query the database?
Merge the data at the client level? A javascript solution?
In your Users table you have a field to save a timestamp for last login. In your table were the friend-relationships are stored you have 1 row per relationship and that makes the table really long.
So joining these tables seems bad and we should optimize this process somehow? The answer is: No, not necessarily. The people who construct a DBMS have the same problems as you and they implement the tools to solve them. Every DBMS has some sort of query optimization which is smarter than you and me.
So there's no shame in joining long tables. If you want to try to optimize you may:
Get the IDs of the friends of the user.
Get the information you want of the first 10 friends sorted by last_login desc where the id fits (and other where conditions).
You don't need to join the tables, but you will use two queries, so maybe if your DBMS is smart a join is faster (Maybe run a test).
If you want to, you can use ajax to load this data after the page was loaded, this improve the experience for the user, but the traffic on the DB will be the same.
I hope this helped.
Edit: Oh yeah, if you already knew the friends IDs (you need them for other stuff) you wouldn't even need a join. You can pass the IDs over to the javascript which loads the last login list later via AJAX.
I need an efficient way to search through my models to find a specific User, here's a list,
User - list of users, their names, etc.
Events - table of events for all users, on when they're not available
Skills - many-to-many relationship with the User, a User could have a lot of skills
Contracts - many-to-one with User, a User could work on multiple contracts, each with a rating (if completed)
... etc.
So I got a lot of tables linked to the User table. I need to search for a set of users fitting certain criteria; for example, he's available from next Thurs through Fri, has x/y/z skills, and has received an average 4 rating on all his completed contracts.
Is there some way to do this search efficiently while minimizing the # of times I hit the database? Sorry if this is a very newb question.
Thanks!
Not sure if this method will solve you issue for all 4 cases, but at least it should help you out in the first one - querying users data efficiently.
I usually find using values or values_list query function faster because it slims down the SELECT part of the actual SQL, and therefore you will get results faster. Django docs regarding this.
Also worth mentioning that starting with new dev version within values and values_list you can query any type of relationship, including many_to_one.
And finally you might find in_bulk also useful. If I do a complex query, you might try to query the ids first of some models using values or values_list and then use in_bulk to get the model instances faster. Django docs about that.
Details
A Person has many Objectives.
Objectives have Person-specific details about Activitys.
An Activity contains generic information such as a world record.
A Person can organize an Event to attempt the Objective.
A Person invites other Persons to watch an Event with an Invitation.
Schema
Note: Only backref's are listed on the example schema diagram, indicated by "(fk)". The arrows imply the normal relationship.
Image Link Until I Get 10 Points To Use Image Tag
Question
I want most Event, Objective, and Activity details for all Invitations one Person received (irregardless of status, but the status is still needed) displayed at once.
Is there a better way to represent the problem before I try tackling a JOIN like this? I believe the Person -> Invitation <- Event is an Association Object pattern, but I am unsure of how to get the Objective and Activity information in a clean, efficient manner for each Invitation returned.
Bonus: Provide sample SQLAlchemy query.
On the SQL side, this is pretty straightforward. I built some tables for testing using only one id number (for persons); all the rest of the keys were natural keys. Looking at the execution plan for this query
select I.*, A.activity_placeholder, E.event_location, O.objective_placeholder
from event_invitations I
inner join activity A
on (I.activity_name = A.activity_name)
inner join events E
on (I.personal_id = E.personal_id
and I.activity_name = E.activity_name
and I.objective_deadline = E.objective_deadline
and I.event_time = E.event_time)
inner join personal_objectives O
on (I.personal_id = O.personal_id
and I.activity_name = O.activity_name
and I.objective_deadline = O.objective_deadline)
where I.person_invited_id = 2;
shows that the dbms (PostgreSQL) is using indexes throughout except for a sequential scan on event_invitations. I'm sure that's because I used very little data, so all these tables easily fit into RAM. (When tables fit in RAM, it's often faster to scan a small table than to use the index.)
The optimizer estimates the cost for each part of the query to be 0.00, and you can't get much better than that. Actual run time was less than 0.2 milliseconds, but that doesn't mean much.
I'm sure you can translate this into SQLAlchemy. Let me know if you want me to post my tables and sample data.