Details
A Person has many Objectives.
Objectives have Person-specific details about Activitys.
An Activity contains generic information such as a world record.
A Person can organize an Event to attempt the Objective.
A Person invites other Persons to watch an Event with an Invitation.
Schema
Note: Only backref's are listed on the example schema diagram, indicated by "(fk)". The arrows imply the normal relationship.
Image Link Until I Get 10 Points To Use Image Tag
Question
I want most Event, Objective, and Activity details for all Invitations one Person received (irregardless of status, but the status is still needed) displayed at once.
Is there a better way to represent the problem before I try tackling a JOIN like this? I believe the Person -> Invitation <- Event is an Association Object pattern, but I am unsure of how to get the Objective and Activity information in a clean, efficient manner for each Invitation returned.
Bonus: Provide sample SQLAlchemy query.
On the SQL side, this is pretty straightforward. I built some tables for testing using only one id number (for persons); all the rest of the keys were natural keys. Looking at the execution plan for this query
select I.*, A.activity_placeholder, E.event_location, O.objective_placeholder
from event_invitations I
inner join activity A
on (I.activity_name = A.activity_name)
inner join events E
on (I.personal_id = E.personal_id
and I.activity_name = E.activity_name
and I.objective_deadline = E.objective_deadline
and I.event_time = E.event_time)
inner join personal_objectives O
on (I.personal_id = O.personal_id
and I.activity_name = O.activity_name
and I.objective_deadline = O.objective_deadline)
where I.person_invited_id = 2;
shows that the dbms (PostgreSQL) is using indexes throughout except for a sequential scan on event_invitations. I'm sure that's because I used very little data, so all these tables easily fit into RAM. (When tables fit in RAM, it's often faster to scan a small table than to use the index.)
The optimizer estimates the cost for each part of the query to be 0.00, and you can't get much better than that. Actual run time was less than 0.2 milliseconds, but that doesn't mean much.
I'm sure you can translate this into SQLAlchemy. Let me know if you want me to post my tables and sample data.
Related
I have a products table in postgres that stores all the data I need. What I need to work out the best way to do:
Each product has a different status on the way to completion - machined, painted, assembled, etc. For each status there is a letter that changes in the product id.
What would the most efficient way of saving the data? For each status of the product should there be 'another product' in the table? Or would doing join tables somewhere work?
Example:
111a1 for machined
111b1 for painted
Yet these are the same end product, just at different stages ...
It depends on what you want to be efficient: storage, ingestion, queries, maintainability...
Joins would work - you can join on some substring of the product id, so you need not have separate products for every stage of production.
But maintainability of your code is really important. Businesses change - perhaps to include sub-assemblies.
You might want to re-think this scheme of altering the product id to show the status. Product ID and work flow state are orthogonal concepts. So you probably want to have them in separate fields. You'll probably write far less code that way. The alternative will be becoming really well acquainted with substr() (depending on your SQL dialect), and all sorts of duplications elsewhere.
I use GAE NDB Python
Approach 1:
# both models below have similar properties (same number and type)
class X1(ndb.Model):
p1 = ndb.StringProperty()
::
class X2(ndb.Model):
p1 = ndb.StringProperty()
::
def get(self):
q = self.request.get("q")
w = self.request.get("w")
record_list = []
if (q=="a"):
qry = X1.query(X1.p1==w)
record_list = qry.fetch()
elif (q=="b"):
qry = X2.query(X2.p1==w)
record_list = qry.fetch()
Approach 2:
class X1(ndb.Model):
p1 = ndb.StringProperty()
::
def get(self):
q = self.request.get("q")
w = self.request.get("w")
if (q=="a"):
k = ndb.Key("type_1", "k1")
elif (q=="b"):
k = ndb.Key("type_2", "k1")
qry = X1.query(ancestor=k, X1.p1==w)
record_list = qry.fetch()
My Questions:
Which approach is better in terms of query performance when I scale up the entities
Would there be significant impact on query performance if I scale up the ancestors (in the same hierarchy level horizontally) to 10,000 or 1,00,000 in approach 2
Is this application the correct use case for ancestor
Context:
This project is for understanding GAE better and the goal is to create an ecommerce website like amazon.com where I need to query based on a lot many(10) filter conditions(like, price range, brand, screen size, and so on). Each filter condition has few ranges(like, there could be five price bands); multiple ranges of a filter condition could be selected simultaneously. Multiple filter conditions could be selected just like on amazon.com left pane.
If I put all the filter conditions in the query in the form of AND, OR connected expression, it would take huge amount of time for scaled data sets even if I use query cursor and fetch by page.
To overcome this, I thought I would store the data in entities with parent as a string. The parent would be a cancatenation of the the different filters options which the product matches. There would be a lot of redundancy as I would store the same data in several entities for all the combinations of filter values which it satisfies. The disadvantage of this approach is that each product data is being stored multiple times in different entities(much more storage); but I was hoping to get a much better query performance(<2 seconds) since now my query string would contain only one or two AND or OR connected elements apart from ancestor. The ancestor would be the concatenation of the filter conditions which the user has selected to search for a product
Please let me know if I am not clear.. This is just an experimental approach that I am trying.. Another approach would have been to cache the results through a cron job periodically..
Any other suggestion to achieve a good query performance for such a website would be highly appreciated..
UPDATE(NEW STRATEGY):
i have decided to go with a model with some boolean properties(flags) for each range of each category(total such property per entity is ~14).. for one category, which had two possible values, I have three models(one having all entities of with either of the two values, and the other two for entites with each value).. so there is duplication(same data could be store twice in two entities)..
also my complete product data model is a separate one.. the above model contains a key to this complete model..
i could not do away with Query class and write my own filtering(i actually did that with good success initially).. the reason is that i need to fetch results page by page(~15 results).. and i need to sort them too.. if i fetch all results and apply my own filtering, with large data set the fetching of all results takes a huge amount of time because of the large size of the results returned..
the initial development server results look good.. query execution time is <3 seconds for ~6000 matched entities.. (though i wished it to be ~1 second).. need to scale up the production datastore to test there..
EDIT after context definition:
Tough subject there. You have plenty of datastore limitations that can get in your way :
Write throughput (1 write/sec per Entity Group)
Query inequality filters limit
Cross entity group transactions at write time (duplicating your product in each
"query filter" specific entity group )
Max entity size (1MB) if you duplicate whole products for every "query filter" entity
I don't have any "ready made" answer, just some humble advice based on common sense.
In my opinion your first solution will get overly complex as you add new filtering criterias, type of products, etc.
The problem with the datastore, and most "NoSQL" solutions, is that they tend to have few analytic/query features out of the box (they are not at the maturity level of RDBMS that have evolved for years), forcing you to compute results "by hand".
For your case, I don't see anything out of the box, and the "datastore query engine" is clearly not enough for such queries.
Keep your data quite simple though, just store your products as entities with properties.
If you have clearly different product categories, you may store them as different entity kinds -> I highly doubt people will run a "brand" query for both "shoes" and "food".
You will have to run a datastore query within the limitations to quickly get a gross result set, and refine it by hand (map reduce job, async task..) ... and then cache the result for as long as you can.
-> your aggressive cache solutions looks far better from a performance, cost and maintainability standpoint.
You won't be able to cache your whole product base, and some queries for rarities will take longer... like I said, I don't see any perfect answers here, just different tradeoffs for performance.
Just my 2 cents :) I'll be curious in what solution you end up adopting.
You typically use ancestors for data that is own by an entity.
For example :
A Book is your root entity, and it "owns" Page entities.
A Page without a Book is meaningless.
Book is the ancestor of Page.
A User is your root entity, and it "owns" BlogPost entities.
A BlogPost without its Writter is quite meaningless.
User is the ancestor of BlogPost.
If your two entities X1 and X2 share the same attributes, I'd say they are the same X entity, with just an additonal "type" attribute to determine if your talking about X Type1 or X type2.
Let's assume we have a friend list table for a social network.
Most use cases will require the friend list table to be JOINed to another table where you hold the personal details, such as: Name, Age, City, Profile picture URL, Last login time, etc...
Once the friend list table is in the 100M rows range. Querying a JOIN like this can take a few seconds. If you introduce a few other WHERE conditions it can even be slower.
A key-value store systems can bring in the friend list very quickly.
Let's assume we would like to show the 10 most recently logged in friends of a user.
What is the best way to calculate this output? A few methods I've been thinking about are below. Do any of them make sense?
Shall we keep all data in the key-value store environment? Update the
key-value store with every new login?
Or shall we pull the friend list id's first. Then use a database command like "IN()" and query the database?
Merge the data at the client level? A javascript solution?
In your Users table you have a field to save a timestamp for last login. In your table were the friend-relationships are stored you have 1 row per relationship and that makes the table really long.
So joining these tables seems bad and we should optimize this process somehow? The answer is: No, not necessarily. The people who construct a DBMS have the same problems as you and they implement the tools to solve them. Every DBMS has some sort of query optimization which is smarter than you and me.
So there's no shame in joining long tables. If you want to try to optimize you may:
Get the IDs of the friends of the user.
Get the information you want of the first 10 friends sorted by last_login desc where the id fits (and other where conditions).
You don't need to join the tables, but you will use two queries, so maybe if your DBMS is smart a join is faster (Maybe run a test).
If you want to, you can use ajax to load this data after the page was loaded, this improve the experience for the user, but the traffic on the DB will be the same.
I hope this helped.
Edit: Oh yeah, if you already knew the friends IDs (you need them for other stuff) you wouldn't even need a join. You can pass the IDs over to the javascript which loads the last login list later via AJAX.
I am building a book review application.
In the Review model, I have book_id and several fields like author_rating and/or scary_rating.
In the Book model, I have a search() function that I'd like to use to search for books with certain characteristics, like an author_rating of above 5, for example.
What is the best way to accomplish this? I know these are probably wrong, but I could add the attributes (author_rating, scary_rating, etc) into the book model and update them (average them in) each time a review is submitted; or, I could run a cron task that updates those fields every so often.
But is there a better way where I could query both the Book and Review models to come up with Books and meet certain criteria defined by looking at the Reviews database?
Does that make sense?
you could use union, to combine two queries!!!
query 1......
union
query 2.....
order by ratings LIMIT 1
in this kind of format, if you want real life example, then im willing to give you one,
SELECT
u.username, u.picture,m.id, m.user_note, m.reply_id, m.reply_name, m.dt
FROM
relationships r,
notes m,
user u
WHERE
m.user_id = r.leader
AND
r.leader = u.user_id
AND
r.listener = '$user_id'
UNION
select username, picture,id, user_note, reply_id, reply_name, dt
from user u, notes b
where u.user_id = b.user_id
and
b.user_id ='$user_id'
ORDER BY dt DESC LIMIT 10";
It sounds like you have a bunch of reviews and you want to average them on the fly when someone searches, is this correct? If so, I think it would probably work better to save the averaged ratings as fields in the Book model.
The reason I say this is that it is likely your app will be doing more searching than saving reviews, and it's probably more important to have searches returned quickly than reviews saved quickly.
However, if you want to calculate these averages on the fly, you can use MySQL's AVG() function. Check this SO post for an example that I think is pretty close to your situation:
MySQL - Can I combine these 2 SQL statements? Combine JOIN and AVG?
Also, this is a situation where I'd probably just write the SQL and stick it into a query(), rather than trying to wrestle with Cake's ORM syntax:
http://book.cakephp.org/view/456/query
I'm currently hand-writing a DAL in C# with SqlDataReader and stored procedures. Performance is important, but it still should be maintainable...
Let's say there's a table recipes
(recipeID, author, timeNeeded, yummyFactor, ...)
and a table ingredients
(recipeID, name, amount, yummyContributionFactor, ...)
Now I'd like to query like 200 recipes with their ingredients. I see the following possibilities:
Query all recipes, then query the ingredients for each recipe.
This would of course result in maaany queries.
Query all recipes and their ingredients in a big joined list. This will cause a lot of useless traffic, because every recipe data will be transmitted multiple times.
Query all recipes, then query all the ingredients at once by passing the list of recipeIDs back to the database. Alternatively issue both queries at one and return multiple resultsets. Back in the DAL, associate the ingredients to the recipes by their recipeID.
Exotic way: Cursor though all recipes and return for each recipe two separate resultsets for recipe and ingredients. Is there a limit for resultsets?
For more variety, the recipes can be selected by a list of IDs from the DAL or by some parametrized SQL condition.
Which one you think has the best performance/mess ratio?
If you only need to join two tables and an "ingredient" isn't a huge amount of data, the best balance of performance and maintainability is likely to be a single joined query. Yes, you are repeating some data in the results, but unless you have 100,000 rows and it's overloading the database server/network, it's too soon to be optimizing.
The story is a little bit different if you have many layers of joins each with decreasing cardinality. For example, in one of my apps I have something like the following:
Event -> EventType -> EventCategory
-> EventPriority
-> EventSource -> EventSourceType -> Vendor
A query like this results in a significant amount of duplication which is unacceptable when there are 100k events to retrieve, 1000 event types, maybe 10 categories/priorities, 50 sources, and 5 vendors. So in that case, I have a stored procedure that returns multiple result sets:
All 100k Events with just EventTypeID
The 1000 EventTypes with CategoryID, PriorityID, etc. that apply to these Events
The 10 EventCategories and EventPriorities that apply to the above EventTypes
The 50 EventSources that generated the 100k events
And so on, you get the idea.
Because the cardinality goes down so drastically, it is much quicker to download only what is needed here and use a few dictionaries on the client side to piece it together (if that is even necessary). In some cases the low-cardinality data may even be cached in memory and never retrieved from the database at all (except on app start or when the data is changed).
The determining factors in using an approach such as this are a very high number of results and a steep decrease in cardinality for the joins, in other words fanning in. This is actually the reverse of most usages and probably the reverse of what you are doing here. If you are selecting "recipes" and joining to "ingredients", you are probably fanning out, which can make this approach wasteful, especially if there are only two tables to join.
So I'm just putting it out there that this is a possible alternative if performance becomes an issue down the road; at this point in your design, before you have real-world performance data, I would simply go the route of using a single joined result set.
The best performance/mess ratio is 42.
On a more serious note, go with the simplest solution: retrieve everything with a single query. Don't optimize before you encounter a performance issue. "Premature optimization is the root of all evil" :)
One stored proc that returns 2 datasets: "recipe header" and "recipe details"?
This is what I'd do if I needed the data all at once in one go. If I don't need it in one go, I'd still get 2 datasets but with less data.
We've found it slightly easier to work with this in the client rather than one big query as Andomar suggested, but his/her answer is still very valid.
I would look at the bigger picture - do you really need to retrieve ingredients for 200 recipes? What happens when you have 2,000?
For example, if this is in a web page I would have the 200 recipes listed (if not less because of paging), and when the user clicked on one to see the ingredient then I would get the ingredients from the database.
If this isn't doable, I would have 1 stored proc that returns one DataSet containing 2 tables. One with the recipes and the second with the list of ingredients.
"I'm currently hand-writing a DAL in C#..." As a side note, you might want to check out the post: Generate Data Access Layer Methods From Stored Procs. It can save you a lot of time.